Simple Linear Regression (SLR) is one of the most fundamental and widely used tools in epidemiology and public health research. It allows us to:
In epidemiology, we frequently use SLR to model continuous outcomes such as blood pressure, BMI, cholesterol levels, or hospital length of stay as a function of age, exposure levels, or other continuous predictors.
library(tidyverse)
library(haven)
library(here)
library(knitr)
library(kableExtra)
library(plotly)
library(broom)
library(ggeffects)
library(gtsummary)We will use the Behavioral Risk Factor Surveillance System (BRFSS) 2020 data throughout this lecture. The BRFSS is a large-scale, nationally representative telephone survey conducted by the CDC that collects data on health behaviors, chronic conditions, and preventive service use among U.S. adults.
# Load raw BRFSS 2020 data
brfss_full <- read_xpt("C:/Users/God's Icon/Desktop/553-Statistical Inference/553 Lab Codes/data/LLCP2020XPT/LLCP2020.XPT") %>%
janitor::clean_names()
# Select variables of interest
brfss_slr <- brfss_full %>%
select(bmi5, age80, sex, educag, genhlth, physhlth, sleptim1)
# Recode variables
brfss_slr <- brfss_slr %>%
mutate(
bmi = bmi5 / 100,
age = age80,
sex = factor(ifelse(sex == 1, "Male", "Female")),
education = factor(case_when(
educag == 1 ~ "< High school",
educag == 2 ~ "High school graduate",
educag == 3 ~ "Some college",
educag == 4 ~ "College graduate"
), levels = c("< High school", "High school graduate",
"Some college", "College graduate")),
gen_health_num = ifelse(genhlth %in% 1:5, genhlth, NA_real_),
sleep_hrs = ifelse(sleptim1 %in% 1:24, sleptim1, NA_real_),
phys_days = ifelse(physhlth %in% 0:30, physhlth, NA_real_)
)
# Select recoded variables, apply filters, drop missing, take sample
set.seed(553)
brfss_slr <- brfss_slr %>%
select(bmi, age, sex, education, gen_health_num, sleep_hrs, phys_days) %>%
filter(bmi > 14.5, bmi < 60, age >= 18, age <= 80) %>%
drop_na() %>%
slice_sample(n = 3000)
# Save analytic dataset
saveRDS(brfss_slr, here::here(
"C:/Users/God's Icon/Desktop/553-Statistical Inference/553 Lab Codes/data/Simple Linear Regression/LLCP2020XPT_clean.rds"
))
tibble(
Metric = c("Observations", "Variables"),
Value = c(nrow(brfss_slr), ncol(brfss_slr))
) %>%
kable(caption = "Analytic Dataset Dimensions") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Metric | Value |
|---|---|
| Observations | 3000 |
| Variables | 7 |
brfss_slr %>%
select(bmi, age, sleep_hrs, phys_days) %>%
summary() %>%
kable(caption = "Descriptive Statistics: Key Continuous Variables") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| bmi | age | sleep_hrs | phys_days | |
|---|---|---|---|---|
| Min. :14.63 | Min. :18.00 | Min. : 1.000 | Min. : 1.00 | |
| 1st Qu.:24.32 | 1st Qu.:43.00 | 1st Qu.: 6.000 | 1st Qu.: 2.00 | |
| Median :27.89 | Median :58.00 | Median : 7.000 | Median : 6.00 | |
| Mean :29.18 | Mean :55.52 | Mean : 6.915 | Mean :11.66 | |
| 3rd Qu.:32.89 | 3rd Qu.:70.00 | 3rd Qu.: 8.000 | 3rd Qu.:20.00 | |
| Max. :59.60 | Max. :80.00 | Max. :20.000 | Max. :30.00 |
brfss_slr %>%
select(bmi, age, sleep_hrs, sex, education) %>%
tbl_summary(
label = list(
bmi ~ "BMI (kg/m2)",
age ~ "Age (years)",
sleep_hrs ~ "Sleep (hours/night)",
sex ~ "Sex",
education ~ "Education"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 1
) %>%
add_n() %>%
bold_labels() %>%
modify_caption("**Table 1. Descriptive Statistics (BRFSS 2020, n = 3,000)**")| Characteristic | N | N = 3,0001 |
|---|---|---|
| BMI (kg/m2) | 3,000 | 29.2 (7.0) |
| Age (years) | 3,000 | 55.5 (17.4) |
| Sleep (hours/night) | 3,000 | 6.9 (1.7) |
| Sex | 3,000 | |
| Female | 1,701 (57%) | |
| Male | 1,299 (43%) | |
| Education | 3,000 | |
| < High school | 237 (7.9%) | |
| High school graduate | 796 (27%) | |
| Some college | 937 (31%) | |
| College graduate | 1,030 (34%) | |
| 1 Mean (SD); n (%) | ||
Simple linear regression models the mean of a continuous outcome \(Y\) as a linear function of a single predictor \(X\):
\[Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i, \quad i = 1, 2, \ldots, n\]
Where:
| Symbol | Name | Meaning |
|---|---|---|
| \(Y_i\) | Response / Outcome | Observed value for subject \(i\) (e.g., BMI) |
| \(X_i\) | Predictor / Covariate | Observed predictor for subject \(i\) (e.g., age) |
| \(\beta_0\) | Intercept | Expected value of \(Y\) when \(X = 0\) |
| \(\beta_1\) | Slope | Expected change in \(Y\) for a 1-unit increase in \(X\) |
| \(\varepsilon_i\) | Error term | Random deviation of \(Y_i\) from the regression line |
The population regression line (also called the true or theoretical regression line) describes the expected (mean) value of \(Y\) at each value of \(X\):
\[E(Y \mid X) = \mu_{Y|X} = \beta_0 + \beta_1 X\]
| Population | Sample | |
|---|---|---|
| Line | \(\beta_0 + \beta_1 X\) | \(\hat{y} = b_0 + b_1 X\) |
| Intercept | \(\beta_0\) (parameter) | \(b_0\) or \(\hat{\beta}_0\) (estimate) |
| Slope | \(\beta_1\) (parameter) | \(b_1\) or \(\hat{\beta}_1\) (estimate) |
| Error | \(\varepsilon_i\) | \(e_i = Y_i - \hat{Y}_i\) (residual) |
We use our sample to estimate the population parameters. The estimates \(b_0\) and \(b_1\) define the fitted regression line.
Before fitting any model, always visualize the bivariate relationship.
p_scatter <- ggplot(brfss_slr, aes(x = age, y = bmi)) +
geom_point(alpha = 0.15, color = "steelblue", size = 1.2) +
geom_smooth(method = "lm", color = "red", linewidth = 1.2, se = TRUE) +
geom_smooth(method = "loess", color = "blue", linewidth = 1,
linetype = "dashed", se = FALSE) +
labs(
title = "BMI vs. Age (BRFSS 2020)",
subtitle = "Red = Linear fit | Orange dashed = LOESS smoother",
x = "Age (years)",
y = "BMI (kg/m²)"
) +
theme_minimal(base_size = 13)
ggplotly(p_scatter)BMI vs. Age — BRFSS 2020
Interpretation tip: The LOESS smoother (orange) follows the data without assuming linearity. When it closely tracks the linear fit (red), a linear model is reasonable. Departures suggest nonlinearity.
A useful mnemonic is LINE:
| Letter | Assumption | Description |
|---|---|---|
| L | Linearity | The relationship between \(X\) and \(E(Y)\) is linear |
| I | Independence | Observations are independent of one another |
| N | Normality | Errors \(\varepsilon_i\) are normally distributed |
| E | Equal variance | Errors have constant variance (homoscedasticity): \(\text{Var}(\varepsilon_i) = \sigma^2\) |
Formally, we assume:
\[\varepsilon_i \overset{iid}{\sim} N(0, \sigma^2)\]
This means that for any value of \(X\), the distribution of \(Y\) is:
\[Y \mid X \sim N(\beta_0 + \beta_1 X, \; \sigma^2)\]
Note on independence: In cross-sectional survey data like BRFSS, observations from the same household or geographic cluster may not be fully independent. We acknowledge this limitation but proceed with the standard SLR framework for pedagogical purposes.
We estimate \(\beta_0\) and \(\beta_1\) by finding the values \(b_0\) and \(b_1\) that minimize the sum of squared residuals (SSR):
\[SSR = \sum_{i=1}^{n}(Y_i - \hat{Y}_i)^2 = \sum_{i=1}^{n}(Y_i - b_0 - b_1 X_i)^2\]
This is called the Ordinary Least Squares (OLS) criterion. Minimizing SSR yields the closed-form solutions:
\[b_1 = \frac{\sum_{i=1}^n (X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^n (X_i - \bar{X})^2} = \frac{S_{XY}}{S_{XX}}\]
\[b_0 = \bar{Y} - b_1 \bar{X}\]
where \(\bar{X}\) and \(\bar{Y}\) are the sample means of \(X\) and \(Y\).
Gauss-Markov Theorem: Under the LINE assumptions, OLS estimators are the Best Linear Unbiased Estimators (BLUE) — they have the smallest variance among all linear unbiased estimators.
# Fit simple linear regression: BMI ~ Age
model_slr <- lm(bmi ~ age, data = brfss_slr)
# Summary output
summary(model_slr)##
## Call:
## lm(formula = bmi ~ age, data = brfss_slr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.633 -4.883 -1.325 3.688 30.340
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.528231 0.427507 69.071 <2e-16 ***
## age -0.006238 0.007347 -0.849 0.396
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.012 on 2998 degrees of freedom
## Multiple R-squared: 0.0002404, Adjusted R-squared: -9.312e-05
## F-statistic: 0.7208 on 1 and 2998 DF, p-value: 0.396
# Tidy coefficient table
tidy(model_slr, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
kable(
caption = "Simple Linear Regression: BMI ~ Age (BRFSS 2020)",
col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
"p-value", "95% CI Lower", "95% CI Upper"),
align = "lrrrrrrr"
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE)| Term | Estimate | Std. Error | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 29.5282 | 0.4275 | 69.0708 | 0.000 | 28.6900 | 30.3665 |
| age | -0.0062 | 0.0073 | -0.8490 | 0.396 | -0.0206 | 0.0082 |
Fitted regression equation:
\[\widehat{\text{BMI}} = 29.528 + -0.0062 \times \text{Age}\]
Intercept (\(b_0 = 29.528\)): The estimated mean BMI when age = 0. This is a mathematical artifact — a newborn does not have an adult BMI. The intercept is not directly interpretable in this context, but is necessary to anchor the line.
Slope (\(b_1 = -0.0062\)): For each 1-year increase in age, BMI is estimated to decrease by 0.0062 kg/m², on average, holding all else constant (though there are no other variables in this simple model).
Practical significance vs. statistical significance: Even a small slope can be highly statistically significant with a large sample. Always consider whether the magnitude is meaningful in the real world.
# Augment dataset with fitted values and residuals
augmented <- augment(model_slr)
# Show a sample of fitted values and residuals
augmented %>%
select(bmi, age, .fitted, .resid) %>%
slice_head(n = 10) %>%
mutate(across(where(is.numeric), ~ round(., 3))) %>%
kable(
caption = "First 10 Observations: Observed, Fitted, and Residual Values",
col.names = c("Observed BMI (Y)", "Age (X)", "Fitted (Ŷ)", "Residual (e = Y − Ŷ)")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Observed BMI (Y) | Age (X) | Fitted (Ŷ) | Residual (e = Y − Ŷ) |
|---|---|---|---|
| 26.58 | 67 | 29.110 | -2.530 |
| 33.47 | 38 | 29.291 | 4.179 |
| 35.15 | 78 | 29.042 | 6.108 |
| 30.42 | 65 | 29.123 | 1.297 |
| 22.67 | 55 | 29.185 | -6.515 |
| 30.11 | 80 | 29.029 | 1.081 |
| 35.43 | 34 | 29.316 | 6.114 |
| 31.58 | 71 | 29.085 | 2.495 |
| 28.13 | 55 | 29.185 | -1.055 |
| 34.01 | 62 | 29.141 | 4.869 |
The fitted value (Y^) and the observed BMI (Y) are different because a person’s age does not perfectly determine their exact BMI.
Fitted Value (Y^): This is the predicted average BMI for anyone of a specific age, based on the trend line calculated by the model. For example, the model might calculate that the average 67-year-old has a BMI of 29.110.
Observed BMI (Y): This is the actual, real-world BMI of a specific individual in the dataset.
The Residual/Error (e=Y−Y^): The difference between the two is the residual or error term. The error term (εi) represents the “random deviation of Yi from the regression line.”
# Select a random sample of 80 points to illustrate residuals
set.seed(42)
resid_sample <- augmented %>% slice_sample(n = 80)
p_resid <- ggplot(resid_sample, aes(x = age, y = bmi)) +
geom_segment(aes(xend = age, yend = .fitted),
color = "tomato", alpha = 0.5, linewidth = 0.5) +
geom_point(color = "steelblue", size = 1.8, alpha = 0.8) +
geom_line(aes(y = .fitted), color = "black", linewidth = 1.1) +
labs(
title = "Residuals Illustrated on the Regression Line",
subtitle = "Red segments = residuals (Y − Ŷ); Black line = fitted regression line",
x = "Age (years)",
y = "BMI (kg/m²)"
) +
theme_minimal(base_size = 13)
p_residVisualizing Residuals on the Regression Line
The total variability in \(Y\) can be decomposed into two parts:
\[\underbrace{SS_{Total}}_{Total\ variability} = \underbrace{SS_{Regression}}_{Explained\ by\ X} + \underbrace{SS_{Residual}}_{Unexplained}\]
Where:
\[SS_{Total} = \sum(Y_i - \bar{Y})^2 \qquad (df = n-1)\] \[SS_{Regression} = \sum(\hat{Y}_i - \bar{Y})^2 \qquad (df = 1)\] \[SS_{Residual} = \sum(Y_i - \hat{Y}_i)^2 \qquad (df = n-2)\]
In statistics, degrees of freedom refer to the number of independent pieces of information that are free to vary when calculating an estimate. You can think of it as the amount of “data budget” you have. Every time you estimate a parameter (like a mean or a slope), you spend one degree of freedom.
Why it’s \(n - 1\): You have \(n\) total observations. However, to calculate this variance, you first had to calculate the sample mean. Because the sum of deviations from the mean must always equal zero, knowing \(n-1\) of the deviations automatically tells you the last one. You “spent” 1 degree of freedom calculating the mean, leaving you with \(n - 1\).
Why it’s \(1\): In Simple Linear Regression, you are using exactly one predictor variable (e.g., Age) to explain the outcome (e.g., BMI). Because you are only estimating one slope parameter (\(\beta_1\)) to capture this relationship, the regression model has 1 degree of freedom. (Note: In multiple regression, this would be equal to \(k\), the number of predictor variables).
Why it’s \(n - 2\): To calculate the predicted values (\(\hat{Y}\)) and find the residuals, your model had to estimate two parameters from the data: the intercept (\(\beta_0\)) and the slope (\(\beta_1\)). Since you “spent” 2 pieces of information to create the regression line, you are left with \(n - 2\) degrees of freedom for the errors.
# ANOVA decomposition
anova_slr <- anova(model_slr)
anova_slr %>%
kable(
caption = "ANOVA Table: BMI ~ Age",
digits = 3,
col.names = c("Source", "Df", "Sum Sq", "Mean Sq", "F value", "Pr(>F)")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Source | Df | Sum Sq | Mean Sq | F value | Pr(>F) |
|---|---|---|---|---|---|
| age | 1 | 35.438 | 35.438 | 0.721 | 0.396 |
| Residuals | 2998 | 147400.214 | 49.166 | NA | NA |
The Mean Squared Error estimates the variance of the error term:
\[MSE = \frac{SS_{Residual}}{n - 2} = \hat{\sigma}^2\]
The Residual Standard Error \(\hat{\sigma} = \sqrt{MSE}\) is in the same units as \(Y\) and tells us the typical prediction error of the model.
n <- nrow(brfss_slr)
ss_resid <- sum(augmented$.resid^2)
mse <- ss_resid / (n - 2)
sigma_hat <- sqrt(mse)
tibble(
Quantity = c("SS Residual", "MSE (σ̂²)", "Residual Std. Error (σ̂)"),
Value = c(round(ss_resid, 2), round(mse, 3), round(sigma_hat, 3)),
Units = c("", "", "kg/m²")
) %>%
kable(caption = "Model Error Estimates") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Quantity | Value | Units |
|---|---|---|
| SS Residual | 147400.210 | |
| MSE (σ̂²) |
49.16
|
Interpretation: On average, our model’s predictions are off by about 7.01 BMI units.
\(R^2\) measures the proportion of total variability in \(Y\) explained by the linear regression on \(X\):
\[R^2 = \frac{SS_{Regression}}{SS_{Total}} = 1 - \frac{SS_{Residual}}{SS_{Total}}\]
\(R^2\) ranges from 0 to 1:
# Extract R-squared from model
r_sq <- summary(model_slr)$r.squared
adj_r_sq <- summary(model_slr)$adj.r.squared
tibble(
Metric = c("R²", "Adjusted R²", "Variance Explained"),
Value = c(
round(r_sq, 4),
round(adj_r_sq, 4),
paste0(round(r_sq * 100, 2), "%")
)
) %>%
kable(caption = "R² and Adjusted R²") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Metric | Value |
|---|---|
| R² | 2e-04 |
| Adjusted R² | -1e-04 |
| Variance Explained | 0.02% |
For simple linear regression:
\[R^2 = r^2\]
where \(r\) is the Pearson correlation coefficient between \(X\) and \(Y\).
r_pearson <- cor(brfss_slr$age, brfss_slr$bmi)
tibble(
Quantity = c("Pearson r", "r² (from Pearson)", "R² (from model)", "r² = R²?"),
Value = c(
round(r_pearson, 4),
round(r_pearson^2, 4),
round(r_sq, 4),
as.character(round(r_pearson^2, 4) == round(r_sq, 4))
)
) %>%
kable(caption = "Pearson r vs. R² from Model") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Quantity | Value |
|---|---|
| Pearson r | -0.0155 |
| r² (from Pearson) | 2e-04 |
| R² (from model) | 2e-04 |
| r² = R²? | TRUE |
Important caveat: A low \(R^2\) does not mean the regression is useless. In epidemiology, outcomes are influenced by many unmeasured factors, so \(R^2\) values of 0.05–0.20 can still yield scientifically meaningful and statistically significant estimates.
The most important hypothesis test in SLR is:
\[H_0: \beta_1 = 0 \quad \text{(no linear relationship between X and Y)}\] \[H_A: \beta_1 \neq 0 \quad \text{(there is a linear relationship)}\]
Test statistic:
\[t = \frac{b_1 - 0}{SE(b_1)} \sim t_{n-2} \quad \text{under } H_0\]
Where:
\[SE(b_1) = \frac{\hat{\sigma}}{\sqrt{\sum(X_i - \bar{X})^2}} = \frac{\hat{\sigma}}{\sqrt{S_{XX}}}\]
# Extract slope test statistics
slope_test <- tidy(model_slr, conf.int = TRUE) %>% filter(term == "age")
tibble(
Quantity = c("Slope (b₁)", "SE(b₁)", "t-statistic",
"Degrees of freedom", "p-value", "95% CI Lower", "95% CI Upper"),
Value = c(
round(slope_test$estimate, 4),
round(slope_test$std.error, 4),
round(slope_test$statistic, 3),
n - 2,
format.pval(slope_test$p.value, digits = 3),
round(slope_test$conf.low, 4),
round(slope_test$conf.high, 4)
)
) %>%
kable(caption = "t-Test for the Slope (H₀: β₁ = 0)") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Quantity | Value |
|---|---|
| Slope (b₁) | -0.0062 |
| SE(b₁) | 0.0073 |
| t-statistic | -0.849 |
| Degrees of freedom | 2998 |
| p-value | 0.396 |
| 95% CI Lower | -0.0206 |
| 95% CI Upper | 0.0082 |
Decision: With p = 0.396, we reject \(H_0\) at the \(\alpha = 0.05\) level. There is statistically significant evidence of a linear association between age and BMI.
The F-test evaluates whether the overall model (i.e., all predictors together) explains a statistically significant portion of the variability in \(Y\). For simple linear regression with one predictor, the F-test is equivalent to the t-test for the slope (\(F = t^2\)).
\[F = \frac{MS_{Regression}}{MS_{Residual}} \sim F_{1,\, n-2} \quad \text{under } H_0\]
The Numerator: This represents the amount of variability in the data that your model successfully explains.
The Denominator: This represents the “leftover” or unexplained variability (the errors).
When your model does a good job of predicting the outcome, the explained variance goes up, and the unexplained variance goes down. This makes the resulting F-statistic larger. What a higher F-statistic means for your results:
Lower p-value: A larger F-statistic pushes the p-value closer to zero.
Statistical Significance: If the p-value drops below your alpha level (usually 0.05), you can reject the null hypothesis. It gives you confidence that your overall model actually has some predictive power, rather than just capturing random noise.
f_stat <- summary(model_slr)$fstatistic
f_value <- f_stat[1]
df1 <- f_stat[2]
df2 <- f_stat[3]
p_f <- pf(f_value, df1, df2, lower.tail = FALSE)
tibble(
Quantity = c("F-statistic", "df (numerator)", "df (denominator)",
"p-value", "Verification: t²", "Verification: F"),
Value = c(
round(f_value, 3),
df1,
df2,
format.pval(p_f, digits = 3),
round(slope_test$statistic^2, 3),
round(f_value, 3)
)
) %>%
kable(caption = "F-Test for Overall Model Significance") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Quantity | Value |
|---|---|
| F-statistic | 0.721 |
| df (numerator) | 1 |
| df (denominator) | 2998 |
| p-value | 0.396 |
| Verification: t² | 0.721 |
| Verification: F | 0.721 |
A 95% CI for \(\beta_1\) is:
\[b_1 \pm t_{n-2, \, 0.025} \times SE(b_1)\]
t_crit <- qt(0.975, df = n - 2)
ci_lower <- slope_test$estimate - t_crit * slope_test$std.error
ci_upper <- slope_test$estimate + t_crit * slope_test$std.error
tibble(
Bound = c("95% CI Lower", "95% CI Upper"),
Value = c(round(ci_lower, 4), round(ci_upper, 4)),
Units = c("kg/m² per year", "kg/m² per year")
) %>%
kable(caption = "95% Confidence Interval for β₁ (manually computed)") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Bound | Value | Units |
|---|---|---|
| 95% CI Lower | -0.0206 | kg/m² per year |
| 95% CI Upper | 0.0082 | kg/m² per year |
A confidence interval for the mean response \(E(Y \mid X = x^*)\) gives a range of plausible values for the population mean of \(Y\) at a specific value \(x^*\):
\[\hat{Y}^* \pm t_{n-2, \, \alpha/2} \times SE(\hat{Y}^*)\]
Where:
\[SE(\hat{Y}^*) = \hat{\sigma}\sqrt{\frac{1}{n} + \frac{(x^* - \bar{X})^2}{S_{XX}}}\]
A prediction interval gives a range for a single new observation \(Y^*_{new}\) at \(X = x^*\). It is always wider than the confidence interval because it accounts for both the uncertainty in \(E(Y)\) and the individual variability around the mean:
\[\hat{Y}^* \pm t_{n-2, \, \alpha/2} \times SE_{pred}\]
Where:
\[SE_{pred} = \hat{\sigma}\sqrt{1 + \frac{1}{n} + \frac{(x^* - \bar{X})^2}{S_{XX}}}\]
# Compute CI and PI at specific age values
new_ages <- data.frame(age = c(25, 35, 45, 55, 65, 75))
ci_pred <- predict(model_slr, newdata = new_ages, interval = "confidence") %>%
as.data.frame() %>%
rename(Fitted = fit, CI_Lower = lwr, CI_Upper = upr)
pi_pred <- predict(model_slr, newdata = new_ages, interval = "prediction") %>%
as.data.frame() %>%
rename(PI_Lower = lwr, PI_Upper = upr) %>%
select(-fit)
results_table <- bind_cols(new_ages, ci_pred, pi_pred) %>%
mutate(across(where(is.numeric), ~ round(., 2)))
results_table %>%
kable(
caption = "Fitted Values, 95% Confidence Intervals, and Prediction Intervals by Age",
col.names = c("Age", "Fitted BMI", "CI Lower", "CI Upper", "PI Lower", "PI Upper")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
add_header_above(c(" " = 2, "95% CI for Mean" = 2, "95% PI for Individual" = 2))| Age | Fitted BMI | CI Lower | CI Upper | PI Lower | PI Upper |
|---|---|---|---|---|---|
| 25 | 29.37 | 28.87 | 29.88 | 15.61 | 43.13 |
| 35 | 29.31 | 28.92 | 29.70 | 15.56 | 43.06 |
| 45 | 29.25 | 28.95 | 29.54 | 15.50 | 43.00 |
| 55 | 29.19 | 28.93 | 29.44 | 15.43 | 42.94 |
| 65 | 29.12 | 28.84 | 29.41 | 15.37 | 42.87 |
| 75 | 29.06 | 28.68 | 29.44 | 15.31 | 42.81 |
# Generate CI and PI across the full age range
age_grid <- data.frame(age = seq(18, 80, length.out = 200))
ci_band <- predict(model_slr, newdata = age_grid, interval = "confidence") %>%
as.data.frame() %>%
bind_cols(age_grid)
pi_band <- predict(model_slr, newdata = age_grid, interval = "prediction") %>%
as.data.frame() %>%
bind_cols(age_grid)
p_ci_pi <- ggplot() +
geom_point(data = brfss_slr, aes(x = age, y = bmi),
alpha = 0.10, color = "steelblue", size = 1) +
geom_ribbon(data = pi_band, aes(x = age, ymin = lwr, ymax = upr),
fill = "lightblue", alpha = 0.3) +
geom_ribbon(data = ci_band, aes(x = age, ymin = lwr, ymax = upr),
fill = "steelblue", alpha = 0.4) +
geom_line(data = ci_band, aes(x = age, y = fit),
color = "red", linewidth = 1.2) +
labs(
title = "Simple Linear Regression: BMI ~ Age",
subtitle = "Dark band = 95% CI for mean response | Light band = 95% PI for individual observation",
x = "Age (years)",
y = "BMI (kg/m²)",
caption = "BRFSS 2020, n = 3,000"
) +
theme_minimal(base_size = 13)
p_ci_piRegression Line with 95% Confidence and Prediction Intervals
Key distinction: If you want to estimate the average BMI for all 45-year-olds in the population, use the confidence interval. If you want to predict the BMI of a specific new 45-year-old patient, use the prediction interval.
Fitting a regression model is not enough — we must verify that the LINE assumptions are reasonably met. We do this through residual diagnostics.
par(mfrow = c(2, 2))
plot(model_slr, which = 1:4,
col = adjustcolor("steelblue", 0.4),
pch = 19, cex = 0.6)Standard Regression Diagnostic Plots
Interpreting each plot:
1. Residuals vs. Fitted: Checks linearity and equal variance. We want a horizontal red line and random scatter with no pattern. A “fan shape” (spread increasing with fitted values) indicates heteroscedasticity.
2. Normal Q-Q Plot: Checks normality of residuals. Points should fall approximately along the 45° reference line. Heavy tails or S-curves suggest non-normality.
3. Scale-Location (Spread-Location): Another check for equal variance (homoscedasticity). The square root of standardized residuals is plotted against fitted values. A flat line indicates constant variance.
4. Residuals vs. Leverage: Identifies influential observations using Cook’s distance. Points in the upper or lower right corner (beyond the dashed lines) have high influence.
The Residuals vs. Predictor plot is a visual diagnostic tool used to check if your data meets the core assumptions of Simple Linear Regression—specifically Linearity and Equal Variance (Homoscedasticity).
p_resid_x <- ggplot(augmented, aes(x = age, y = .resid)) +
geom_point(alpha = 0.15, color = "steelblue", size = 1) +
geom_hline(yintercept = 0, color = "red", linewidth = 1) +
geom_smooth(method = "loess", color = "orange", se = FALSE, linewidth = 1) +
labs(
title = "Residuals vs. Age",
subtitle = "Should show no pattern — random scatter around zero",
x = "Age (years)",
y = "Residuals"
) +
theme_minimal(base_size = 13)
p_resid_xResiduals vs. Age — Checking Linearity
Y-axis (Residuals): The errors from your model. This is how far off each person’s actual BMI was from the model’s predicted BMI.
X-axis (Age): Your predictor variable.
(Note: In Simple Linear Regression with only one predictor, this plot looks identical in shape to the “Residuals vs. Fitted” plot).
The Red Line: This is a flat, horizontal line at exactly zero. If the model predicted a person’s BMI perfectly, their point would land exactly on this line.
The Orange Line (LOESS Smoother): This is a moving average of the residuals. It helps your eye track the overall trend of the errors across different ages.
The orange line stays fairly close to the red line, though there is a very slight curve downwards at the extreme ends of the age range. The vertical spread of the points looks relatively consistent across the ages. Overall, it doesn’t show any severe violations of linearity or equal variance, though there is a lot of random scatter (which aligns with the very low \(R^2\) value we saw earlier).
p_hist <- ggplot(augmented, aes(x = .resid)) +
geom_histogram(aes(y = after_stat(density)), bins = 40,
fill = "steelblue", color = "white", alpha = 0.8) +
geom_density(color = "red", linewidth = 1) +
stat_function(fun = dnorm,
args = list(mean = mean(augmented$.resid),
sd = sd(augmented$.resid)),
color = "black", linetype = "dashed", linewidth = 1) +
labs(
title = "Distribution of Residuals",
subtitle = "Red = kernel density | Black dashed = normal distribution",
x = "Residuals",
y = "Density"
) +
theme_minimal(base_size = 13)
p_histDistribution of Residuals
# ggplot version of QQ plot
p_qq <- ggplot(augmented, aes(sample = .resid)) +
stat_qq(color = "steelblue", alpha = 0.3, size = 1) +
stat_qq_line(color = "red", linewidth = 1) +
labs(
title = "Normal Q-Q Plot of Residuals",
subtitle = "Points should lie on the red line if residuals are normally distributed",
x = "Theoretical Quantiles",
y = "Sample Quantiles"
) +
theme_minimal(base_size = 13)
p_qqNormal Q-Q Plot of Residuals
# Cook's distance
augmented <- augmented %>%
mutate(
obs_num = row_number(),
cooks_d = cooks.distance(model_slr),
influential = ifelse(cooks_d > 4 / n, "Potentially influential", "Not influential")
)
n_influential <- sum(augmented$cooks_d > 4 / n)
p_cooks <- ggplot(augmented, aes(x = obs_num, y = cooks_d, color = influential)) +
geom_point(alpha = 0.6, size = 1.2) +
geom_hline(yintercept = 4 / n, linetype = "dashed",
color = "red", linewidth = 1) +
scale_color_manual(values = c("Potentially influential" = "tomato",
"Not influential" = "steelblue")) +
labs(
title = "Cook's Distance",
subtitle = paste0("Dashed line = 4/n threshold | ",
n_influential, " potentially influential observations"),
x = "Observation Number",
y = "Cook's Distance",
color = ""
) +
theme_minimal(base_size = 13) +
theme(legend.position = "top")
p_cooksCook’s Distance: Identifying Influential Observations
To reinforce the concepts, let’s fit a second SLR model examining the association between hours of sleep and BMI.
p_sleep <- ggplot(brfss_slr, aes(x = sleep_hrs, y = bmi)) +
geom_jitter(alpha = 0.15, color = "purple", width = 0.15, height = 0) +
geom_smooth(method = "lm", color = "darkred", linewidth = 1.2, se = TRUE) +
labs(
title = "BMI vs. Nightly Sleep Hours (BRFSS 2020)",
x = "Average Hours of Sleep per Night",
y = "BMI (kg/m²)"
) +
theme_minimal(base_size = 13)
p_sleepBMI vs. Sleep Hours
model_sleep <- lm(bmi ~ sleep_hrs, data = brfss_slr)
tidy(model_sleep, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
kable(
caption = "SLR: BMI ~ Hours of Sleep per Night",
col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
"p-value", "95% CI Lower", "95% CI Upper")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | Std. Error | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 30.7419 | 0.534 | 57.5683 | 0.0000 | 29.6948 | 31.7890 |
| sleep_hrs | -0.2256 | 0.075 | -3.0087 | 0.0026 | -0.3726 | -0.0786 |
b1_sleep <- coef(model_sleep)["sleep_hrs"]
r2_sleep <- summary(model_sleep)$r.squared
tibble(
Metric = c("Slope (b₁)", "R²"),
Value = c(round(b1_sleep, 4), round(r2_sleep, 4))
) %>%
kable(caption = "Sleep Model Key Statistics") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Metric | Value |
|---|---|
| Slope (b₁) | -0.2256 |
| R² | 0.0030 |
Interpretation: Each additional hour of sleep per night is associated with a change of -0.2256 kg/m² in BMI, on average. The direction of this association is negative (more sleep → lower BMI). The model explains 0.3% of variability in BMI. While statistically significant, the effect size is modest, underscoring the multifactorial nature of BMI.
par(mfrow = c(2, 2))
plot(model_sleep, which = 1:4,
col = adjustcolor("purple", 0.4), pch = 19, cex = 0.6)Our linear model estimated a negative slope for age: older adults have, on average, slightly lower BMI. But is that the full story? Cross-sectional data can show a decline at older ages due to survivorship bias — people with very high BMI may die before reaching old age, leaving a healthier-looking older sample. There may also be a genuine nonlinear pattern (BMI rises through middle age, then declines in later life).
We can test this by including an age² term in the model:
\[\widehat{\text{BMI}} = b_0 + b_1 \cdot \text{Age} + b_2 \cdot \text{Age}^2\]
This is still a linear regression model (linear in the coefficients), even though it is nonlinear in the predictor. It allows the slope to change across the range of age.
# Add age-squared term
brfss_slr <- brfss_slr %>%
mutate(age2 = age^2)
# Fit quadratic model
model_quad <- lm(bmi ~ age + age2, data = brfss_slr)
tidy(model_quad, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(., 5))) %>%
kable(
caption = "Quadratic Model: BMI ~ Age + Age²",
col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
"p-value", "95% CI Lower", "95% CI Upper")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | Std. Error | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 18.54178 | 1.08095 | 17.15329 | 0 | 16.42230 | 20.66125 |
| age | 0.47435 | 0.04418 | 10.73772 | 0 | 0.38773 | 0.56096 |
| age2 | -0.00464 | 0.00042 | -11.02651 | 0 | -0.00546 | -0.00381 |
# Compare linear vs. quadratic model
tibble(
Model = c("Linear: BMI ~ Age", "Quadratic: BMI ~ Age + Age²"),
R_squared = c(
round(summary(model_slr)$r.squared, 4),
round(summary(model_quad)$r.squared, 4)
),
Adj_R2 = c(
round(summary(model_slr)$adj.r.squared, 4),
round(summary(model_quad)$adj.r.squared, 4)
),
AIC = c(round(AIC(model_slr), 1), round(AIC(model_quad), 1))
) %>%
kable(
caption = "Model Comparison: Linear vs. Quadratic",
col.names = c("Model", "R²", "Adj. R²", "AIC")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(which.min(c(AIC(model_slr), AIC(model_quad))),
bold = TRUE, background = "#d4edda")| Model | R² | Adj. R² | AIC |
|---|---|---|---|
| Linear: BMI ~ Age | 0.0002 | -0.0001 | 20203.2 |
| Quadratic: BMI ~ Age + Age² | 0.0392 | 0.0386 | 20085.9 |
# Generate predicted values from both models
age_seq <- data.frame(age = seq(18, 80, length.out = 300)) %>%
mutate(age2 = age^2)
pred_linear <- predict(model_slr, newdata = age_seq)
pred_quad <- predict(model_quad, newdata = age_seq)
pred_df <- age_seq %>%
mutate(
linear = pred_linear,
quadratic = pred_quad
) %>%
pivot_longer(cols = c(linear, quadratic),
names_to = "Model", values_to = "Predicted_BMI")
ggplot() +
geom_point(data = brfss_slr, aes(x = age, y = bmi),
alpha = 0.10, color = "steelblue", size = 1) +
geom_line(data = pred_df, aes(x = age, y = Predicted_BMI, color = Model),
linewidth = 1.3) +
scale_color_manual(
values = c("linear" = "red", "quadratic" = "darkorange"),
labels = c("linear" = "Linear fit", "quadratic" = "Quadratic fit (Age + Age²)")
) +
labs(
title = "BMI vs. Age: Linear vs. Quadratic Model",
subtitle = "Does BMI rise then fall with age, or decline monotonically?",
x = "Age (years)",
y = "BMI (kg/m²)",
color = "Model"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "top")Linear vs. Quadratic Fit: BMI ~ Age
Interpretation: If the coefficient on Age² is negative and statistically significant, the fitted curve is an inverted-U — BMI peaks at some middle age and declines thereafter. Extract the peak using \(\text{Age}^* = -b_1 / (2 b_2)\). A positive Age² coefficient would indicate a U-shape (BMI lowest in middle age).
b1_q <- coef(model_quad)["age"]
b2_q <- coef(model_quad)["age2"]
peak_age <- -b1_q / (2 * b2_q)
tibble(
Quantity = c("b₁ (Age)", "b₂ (Age²)", "Peak / Trough Age (-b₁ / 2b₂)"),
Value = c(round(b1_q, 5), round(b2_q, 6), round(peak_age, 1))
) %>%
kable(caption = "Quadratic Model Coefficients and Implied Turning Point") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Quantity | Value |
|---|---|
| b₁ (Age) | 0.474350 |
| b₂ (Age²) | -0.004635 |
| Peak / Trough Age (-b₁ / 2b₂) | 51.200000 |
Caution on interpretation: Even if the quadratic model fits better statistically, be cautious about causal interpretation. The cross-sectional pattern reflects cohort differences in BMI trajectories, not necessarily the aging process within any individual. Survivorship bias (heavier individuals dying earlier) can make the quadratic term appear significant in cross-sectional data.
| Quantity | Formula |
|---|---|
| Slope | \(b_1 = S_{XY} / S_{XX}\) |
| Intercept | \(b_0 = \bar{Y} - b_1 \bar{X}\) |
| SSTotal | \(\sum(Y_i - \bar{Y})^2\) |
| SSRegression | \(\sum(\hat{Y}_i - \bar{Y})^2\) |
| SSResidual | \(\sum(Y_i - \hat{Y}_i)^2\) |
| MSE | \(SS_{Residual} / (n-2)\) |
| \(R^2\) | \(SS_{Reg} / SS_{Total}\) |
| \(SE(b_1)\) | \(\hat{\sigma}/\sqrt{S_{XX}}\) |
| t-statistic | \(b_1 / SE(b_1)\) |
| 95% CI for \(\beta_1\) | \(b_1 \pm t_{n-2, 0.025} \cdot SE(b_1)\) |
In this lab, you will apply Simple Linear Regression to the
BRFSS 2020 dataset using a different outcome variable:
number of days of poor physical health in the past 30
days (phys_days). You will model it as a
continuous outcome predicted by BMI.
Research Question: Is BMI associated with the number of days of poor physical health among U.S. adults?
Use the code below to load the data. The dataset is the same one used in the lecture — you only need to load it once.
# Load packages
library(tidyverse)
library(haven)
library(here)
library(knitr)
library(kableExtra)
library(broom)
# Load raw BRFSS 2020 data
brfss_full <- read_xpt("C:/Users/God's Icon/Desktop/553-Statistical Inference/553 Lab Codes/data/LLCP2020XPT/LLCP2020.XPT") %>%
janitor::clean_names()
# Select variables of interest
brfss_slr <- brfss_full %>%
select(bmi5, age80, sex, physhlth, sleptim1)
# Recode variables
brfss_slr <- brfss_slr %>%
mutate(
bmi = bmi5 / 100,
age = age80,
sex = factor(ifelse(sex == 1, "Male", "Female")),
phys_days = ifelse(physhlth %in% 0:30, physhlth, NA_real_),
sleep_hrs = ifelse(sleptim1 %in% 1:24, sleptim1, NA_real_)
)
# Select recoded variables, apply filters, drop missing, take sample
set.seed(553)
brfss_slr <- brfss_slr %>%
select(bmi, age, sex, phys_days, sleep_hrs) %>%
filter(bmi > 14.5, bmi < 60, age >= 18, age <= 80) %>%
drop_na() %>%
slice_sample(n = 3000)
# Save analytic dataset
saveRDS(brfss_slr, here::here(
"C:/Users/God's Icon/Desktop/553-Statistical Inference/553 Lab Codes/data/Simple Linear Regression/LLCP2020XPT_clean.rds"
))# (a) Create a summary table of phys_days and bmi
brfss_slr %>%
select(bmi, phys_days) %>%
summary() %>%
kable(caption = "Descriptive Statistics: Key Continuous Variables") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| bmi | phys_days | |
|---|---|---|
| Min. :14.63 | Min. : 1.00 | |
| 1st Qu.:24.32 | 1st Qu.: 2.00 | |
| Median :27.89 | Median : 6.00 | |
| Mean :29.18 | Mean :11.66 | |
| 3rd Qu.:32.89 | 3rd Qu.:20.00 | |
| Max. :59.60 | Max. :30.00 |
#Table
brfss_slr %>%
select(bmi, phys_days) %>%
tbl_summary(
label = list(
bmi ~ "BMI (kg/m2)",
phys_days ~ "Physical Days (days)"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 1
) %>%
add_n() %>%
bold_labels() %>%
modify_caption("**Table 1. Descriptive Statistics (BRFSS 2020, n = 3,000)**")| Characteristic | N | N = 3,0001 |
|---|---|---|
| BMI (kg/m2) | 3,000 | 29.2 (7.0) |
| Physical Days (days) | 3,000 | 11.7 (11.2) |
| 1 Mean (SD) | ||
# (b) Create a histogram of phys_days — describe the distribution
# (b) Histogram of phys_days
ggplot(brfss_slr, aes(x = phys_days)) +
geom_histogram(binwidth = 1, fill = "steelblue", color = "white") +
labs(
title = "Distribution of Poor Physical Health Days (Past 30 Days)",
x = "Number of Poor Physical Health Days",
y = "Count"
) +
theme_minimal()# Distribution description:
# The distribution of phys_days is strongly right-skewed. A large proportion
# of respondents reported 0 poor physical health days, creating a sharp spike
# at zero. From there, counts drop off quickly, with a smaller secondary
# concentration near 30 days (the maximum). This bimodal, zero-inflated
# pattern is common in self-reported health day variables.
# (c) Create a scatter plot of phys_days (Y) vs bmi (X)
# (c) Scatter plot of phys_days (Y) vs bmi (X)
p_scatter <- ggplot(brfss_slr, aes(x = bmi, y = phys_days)) +
geom_point(alpha = 0.15, color = "steelblue", size = 1.2) +
geom_smooth(method = "lm", color = "red", linewidth = 1.2, se = TRUE) +
geom_smooth(method = "loess", color = "blue", linewidth = 1,
linetype = "dashed", se = FALSE) +
labs(
title = "Poor Physical Health Days vs. BMI",
x = "BMI",
y = "Number of Poor Physical Health Days"
) +
theme_minimal(base_size = 13)
ggplotly(p_scatter)Questions:
phys_days?
Of bmi? What do you notice about the distribution of
phys_days?Answer
Mean_phys_days: 11.7 sd_phys_days: 11.2 mean_bmi: 29.2 sd_bmi: 7
The distribution of physical days is strongly right-skwed A large proportion of respondents reported 0 poor physical health days, creating a sharp spike at zero. From there, counts drop off quickly, with a maximum concentration around day 30.
Answer
Interpretation of the Scatter Plot Overall pattern: There is a modest positive association between BMI and poor physical health days, as BMI increases, the number of poor physical health days tends to increase slightly.
Linear fit (red solid line): The lm smoother shows a gentle upward slope with a fairly narrow confidence band, suggesting the linear relationship is statistically detectable but not strong.
LOESS fit (blue dashed line): This is the most informative feature of the plot. The LOESS curve shows a non-linear pattern, it starts high at very low BMI values (around 15–20), dips down, then rises again with increasing BMI. This suggests the linear model may not fully capture the true relationship, particularly at the extremes of BMI.
Data structure: The points form horizontal bands at integer values (0, 5, 10, 15, 20, 30), which reflects the discrete, self-reported nature of phys_days. The dense band at y = 30 and y = 0 is consistent with what was observed in the histogram.
Takeaway: While a positive linear trend exists, the LOESS curve suggests some non-linearity, and the wide spread of points indicates high residual variability, meaning BMI alone explains relatively little of the variation in poor physical health days.
# (a) Fit the SLR model: phys_days ~ bmi
model_slr <- lm(phys_days ~ bmi, data = brfss_slr)
# Summary output
summary(model_slr)##
## Call:
## lm(formula = phys_days ~ bmi, data = brfss_slr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.808 -9.160 -5.623 8.943 20.453
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.42285 0.86881 8.544 < 2e-16 ***
## bmi 0.14513 0.02895 5.013 5.66e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.12 on 2998 degrees of freedom
## Multiple R-squared: 0.008314, Adjusted R-squared: 0.007983
## F-statistic: 25.13 on 1 and 2998 DF, p-value: 5.659e-07
# (b) Display a tidy coefficient table with 95% CIs
tidy(model_slr, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
kable(
caption = "Simple Linear Regression: Physical Health Days ~ BMI (BRFSS 2020)",
col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
"p-value", "95% CI Lower", "95% CI Upper"),
align = "lrrrrrrr"
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE)| Term | Estimate | Std. Error | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 7.4228 | 0.8688 | 8.5437 | 0 | 5.7193 | 9.1264 |
| bmi | 0.1451 | 0.0289 | 5.0134 | 0 | 0.0884 | 0.2019 |
# (c) Extract and report: slope, intercept, t-statistic, p-value
# (c) Extract and report: slope, intercept, t-statistic, p-value
slope_test <- tidy(model_slr, conf.int = TRUE)
intercept_row <- slope_test %>% filter(term == "(Intercept)")
slope_row <- slope_test %>% filter(term == "bmi")
b0 <- round(coef(model_slr)[1], 3)
b1 <- round(coef(model_slr)[2], 4)
b0## (Intercept)
## 7.423
## bmi
## 0.1451
tibble(
Quantity = c("Intercept (β₀)", "Slope (β₁)", "SE(b₁)",
"t-statistic", "Degrees of freedom",
"p-value", "95% CI Lower", "95% CI Upper"),
Value = c(
round(intercept_row$estimate, 4),
round(slope_row$estimate, 4),
round(slope_row$std.error, 4),
round(slope_row$statistic, 3),
nrow(brfss_slr) - 2,
format.pval(slope_row$p.value, digits = 3),
round(slope_row$conf.low, 4),
round(slope_row$conf.high, 4)
)
) %>%
kable(caption = "t-Test for the Slope (H₀: β₁ = 0)") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Quantity | Value |
|---|---|
| Intercept (β₀) | 7.4228 |
| Slope (β₁) | 0.1451 |
| SE(b₁) | 0.0289 |
| t-statistic | 5.013 |
| Degrees of freedom | 2998 |
| p-value | 5.66e-07 |
| 95% CI Lower | 0.0884 |
| 95% CI Upper | 0.2019 |
Questions:
Fitted regression equation:
\[\widehat{\text{Phys_Days}} = 7.423 + 0.1451 \times \text{BMI}\]
For each 1-year increase in BMI, poor physical health days is estimated to increase by 0.1451 days, on average. So for example, someone with a BMI of 35 would be expected to report about 1.45 more poor physical health days per month than someone with a BMI of 25, a 10-unit BMI difference.
The intercept (\(b_0\) = 7.4228) not meaningfully interpretable in this context. It represents the estimated mean number of poor physical health days when BMI = 0, which is physiologically impossible, no living adult has a BMI of zero. The intercept serves purely as a mathematical anchor for the regression line and should not be interpreted in any practical or clinical sense.
Null hypothesis: H0:β1=0 H_0: _1 = 0 H0:β1=0 —> BMI has no linear association with the number of poor physical health days among U.S. adults.
Alternative hypothesis: HA:β1≠0 H_A: _1 HA:β1=0 —> BMI has a linear association with the number of poor physical health days.
Test statistic: t(2998)=5.0134t_{(2998)} = 5.0134 t(2998)=5.0134 p-value: p<0.001p < 0.001 p<0.001 Decision: We reject H0H_0 H0 at the α=0.05= 0.05 α=0.05 significance level.
Conclusion: There is statistically significant evidence at the α=0.05= 0.05 α=0.05 level that BMI is positively associated with poor physical health days (t(2998)=5.013t_{(2998)} = 5.013 t(2998)=5.013, p<0.001p < 0.001 p<0.001). The 95% confidence interval for the slope (0.0884, 0.2019)(0.0884, 0.2019) (0.0884, 0.2019) does not contain zero, which is consistent with this conclusion. In other words, the probability of observing a t-statistic as extreme as 5.013 by chance alone — assuming H0H_0 H0 is true and there is truly no association — is less than 0.001, which is far below our threshold of 0.05. Therefore, we have strong statistical evidence to conclude that higher BMI is associated with more days of poor physical health. —
# (a) Display the ANOVA table
anova_slr <- anova(model_slr)
anova_slr %>%
kable(
caption = "ANOVA Table: Poor Physical Health Days ~ BMI",
digits = 3,
col.names = c("Source", "Df", "Sum Sq", "Mean Sq", "F value", "Pr(>F)")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Source | Df | Sum Sq | Mean Sq | F value | Pr(>F) |
|---|---|---|---|---|---|
| bmi | 1 | 3105.365 | 3105.365 | 25.134 | 0 |
| Residuals | 2998 | 370411.743 | 123.553 | NA | NA |
# (b) Compute and report SSTotal, SSRegression, and SSResidual
augmented_slr <- augment(model_slr)
ss_reg <- sum((augmented_slr$.fitted - mean(brfss_slr$phys_days))^2)
ss_resid <- sum(augmented_slr$.resid^2)
ss_total <- ss_reg + ss_resid
n_slr <- nrow(brfss_slr)
tibble(
Source = c("SS Regression", "SS Residual", "SS Total"),
SS = c(round(ss_reg, 2), round(ss_resid, 2), round(ss_total, 2)),
df = c(1, n_slr - 2, n_slr - 1),
MS = c(round(ss_reg / 1, 2), round(ss_resid / (n_slr - 2), 2), NA),
F_stat = c(round((ss_reg / 1) / (ss_resid / (n_slr - 2)), 3), NA, NA)
) %>%
kable(
caption = "ANOVA Decomposition: SSTotal, SSRegression, SSResidual",
col.names = c("Source", "Sum of Squares", "df", "Mean Square", "F-statistic")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Source | Sum of Squares | df | Mean Square | F-statistic |
|---|---|---|---|---|
| SS Regression | 3105.36 | 1 | 3105.36 | 25.134 |
| SS Residual | 370411.74 | 2998 | 123.55 | NA |
| SS Total | 373517.11 | 2999 | NA | NA |
# (c) Compute R² two ways: from the model object and from the SS decomposition
r2_model <- summary(model_slr)$r.squared
adj_r2 <- summary(model_slr)$adj.r.squared
r2_ss <- ss_reg / ss_total
tibble(
Method = c("From model object (summary())",
"From SS decomposition (SSReg / SSTotal)",
"Adjusted R²"),
R_squared = c(round(r2_model, 4),
round(r2_ss, 4),
round(adj_r2, 4)),
Variance_Explained = c(paste0(round(r2_model * 100, 2), "%"),
paste0(round(r2_ss * 100, 2), "%"),
paste0(round(adj_r2 * 100, 2), "%"))
) %>%
kable(
caption = "R² Computed Two Ways",
col.names = c("Method", "R²", "Variance Explained")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Method | R² | Variance Explained |
|---|---|---|
| From model object (summary()) | 0.0083 | 0.83% |
| From SS decomposition (SSReg / SSTotal) | 0.0083 | 0.83% |
| Adjusted R² | 0.0080 | 0.8% |
Questions:
Answer
\(SS_{Total}\) = 373517.11 \(df\)_Total = 2,998
\(SS_{Regression}\) = 3105.36 with \(df\)_Regression = 1
\(SS_{Residual}\) = 87,443.21 with \(df\)_Residual = 2,999
\(F\)-statistic = 25.134
What is the \(R^2\) value? Interpret it in plain English. R2 = 0.0083, meaning approximately 0.83% of the total variation in poor physical health days is explained by BMI alone.
What does this \(R^2\) tell you about how well BMI alone explains variation in poor physical health days? What might explain the remaining variation?
An \(R^2\) of 0.0083 indicates that BMI alone is a very weak predictor of poor physical health days. The remaining 99.17% of variation is unexplained and likely attributable to other factors such as:
Age: older adults may experience more physical health limitations Sex: biological differences may influence physical health outcomes Chronic disease status: conditions such as diabetes or arthritis directly increase poor health days Mental health: depression and anxiety are strongly linked to physical health perception Socioeconomic status: income and access to healthcare influence health outcomes Health behaviors: smoking, physical activity, and diet all affect physical health
# (a) Calculate the fitted BMI value and 95% CI for a person with BMI = 25
new_bmi <- data.frame(bmi = 25)
ci_bmi25 <- predict(model_slr, newdata = new_bmi, interval = "confidence") %>%
as.data.frame() %>%
rename(Fitted = fit, CI_Lower = lwr, CI_Upper = upr) %>%
mutate(across(where(is.numeric), ~ round(., 3)))
ci_bmi25 %>%
kable(
caption = "Fitted Value and 95% Confidence Interval for BMI = 25",
col.names = c("Fitted Value", "95% CI Lower", "95% CI Upper")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Fitted Value | 95% CI Lower | 95% CI Upper |
|---|---|---|
| 11.051 | 10.588 | 11.514 |
# (b) Calculate the 95% prediction interval for a person with BMI = 25
pi_bmi25 <- predict(model_slr, newdata = new_bmi, interval = "prediction") %>%
as.data.frame() %>%
rename(Fitted = fit, PI_Lower = lwr, PI_Upper = upr) %>%
mutate(across(where(is.numeric), ~ round(., 3)))
pi_bmi25 %>%
kable(
caption = "Fitted Value and 95% Prediction Interval for BMI = 25",
col.names = c("Fitted Value", "95% PI Lower", "95% PI Upper")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Fitted Value | 95% PI Lower | 95% PI Upper |
|---|---|---|
| 11.051 | -10.749 | 32.851 |
# Combined table: CI vs PI side by side
bind_cols(
data.frame(BMI = 25),
ci_bmi25,
pi_bmi25 %>% select(PI_Lower, PI_Upper)
) %>%
kable(
caption = "Confidence Interval vs. Prediction Interval at BMI = 25",
col.names = c("BMI", "Fitted", "CI Lower", "CI Upper", "PI Lower", "PI Upper")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
add_header_above(c(" " = 2, "95% CI for Mean" = 2, "95% PI for Individual" = 2))| BMI | Fitted | CI Lower | CI Upper | PI Lower | PI Upper |
|---|---|---|---|---|---|
| 25 | 11.051 | 10.588 | 11.514 | -10.749 | 32.851 |
# (c) Plot the regression line with both the CI band and PI band
bmi_grid <- data.frame(bmi = seq(14.5, 60, length.out = 200))
ci_band <- predict(model_slr, newdata = bmi_grid, interval = "confidence") %>%
as.data.frame() %>%
bind_cols(bmi_grid)
pi_band <- predict(model_slr, newdata = bmi_grid, interval = "prediction") %>%
as.data.frame() %>%
bind_cols(bmi_grid)
ggplot() +
geom_point(data = brfss_slr, aes(x = bmi, y = phys_days),
alpha = 0.10, color = "steelblue", size = 1) +
geom_ribbon(data = pi_band, aes(x = bmi, ymin = lwr, ymax = upr),
fill = "lightblue", alpha = 0.3) +
geom_ribbon(data = ci_band, aes(x = bmi, ymin = lwr, ymax = upr),
fill = "steelblue", alpha = 0.4) +
geom_line(data = ci_band, aes(x = bmi, y = fit),
color = "red", linewidth = 1.2) +
labs(
title = "Poor Physical Health Days ~ BMI",
subtitle = "Dark band = 95% CI for mean response | Light band = 95% PI for individual",
x = "BMI",
y = "Number of Poor Physical Health Days",
caption = "BRFSS 2020, n = 3,000"
) +
theme_minimal(base_size = 13)Questions:
For someone with a BMI of 25, the estimated mean number of poor physical health days is 11.051 days, with a 95% confidence interval of (10.588, 11.514).
For a specific new person with a BMI of 25, the 95% prediction interval is (-10.749, 32.851). The lower bound is negative which is not meaningful for this outcome since days cannot be negative, this highlights a limitation of applying linear regression to a bounded, skewed outcome variable like phys_days.
The prediction interval (−10.749, 32.851) is much wider than the confidence interval (10.588, 11.514) because:
The CI estimates uncertainty around the population mean number of poor physical health days for all people with BMI = 25 — it only accounts for uncertainty in estimating the mean The PI must account for two sources of uncertainty — the uncertainty in estimating the mean plus the natural individual-level variability around that mean
In practice:
Use the CI when you want to estimate the average poor physical health days for a group of people with BMI = 25 (e.g., a public health researcher describing population-level estimates) Use the PI when you want to predict the poor physical health days for a single new individual with BMI = 25 (e.g., a clinician trying to predict an individual patient’s outcome) —
# (a) Produce the four standard diagnostic plots (use par(mfrow = c(2,2)) and plot())
par(mfrow = c(2, 2))
plot(model_slr, which = 1:4,
col = adjustcolor("steelblue", 0.4),
pch = 19, cex = 0.6)par(mfrow = c(1, 1))
# (b) Create a residuals vs. fitted plot using ggplot
p_resid_fitted <- ggplot(augmented_slr, aes(x = .fitted, y = .resid)) +
geom_point(alpha = 0.15, color = "steelblue", size = 1) +
geom_hline(yintercept = 0, color = "red", linewidth = 1) +
geom_smooth(method = "loess", color = "orange", se = FALSE, linewidth = 1) +
labs(
title = "Residuals vs. Fitted Values",
subtitle = "Should show no pattern — random scatter around zero",
x = "Fitted Values",
y = "Residuals"
) +
theme_minimal(base_size = 13)
p_resid_fitted# (c) Create a normal Q-Q plot of residuals using ggplot
p_qq <- ggplot(augmented_slr, aes(sample = .resid)) +
stat_qq(color = "steelblue", alpha = 0.3, size = 1) +
stat_qq_line(color = "red", linewidth = 1) +
labs(
title = "Normal Q-Q Plot of Residuals",
subtitle = "Points should lie on the red line if residuals are normally distributed",
x = "Theoretical Quantiles",
y = "Sample Quantiles"
) +
theme_minimal(base_size = 13)
p_qq# (d) Create a Cook's distance plot
augmented_slr <- augmented_slr %>%
mutate(
obs_num = row_number(),
cooks_d = cooks.distance(model_slr),
influential = ifelse(cooks_d > 4 / n_slr,
"Potentially influential", "Not influential")
)
n_influential <- sum(augmented_slr$cooks_d > 4 / n_slr)
p_cooks <- ggplot(augmented_slr, aes(x = obs_num, y = cooks_d,
color = influential)) +
geom_point(alpha = 0.6, size = 1.2) +
geom_hline(yintercept = 4 / n_slr, linetype = "dashed",
color = "red", linewidth = 1) +
scale_color_manual(values = c("Potentially influential" = "tomato",
"Not influential" = "steelblue")) +
labs(
title = "Cook's Distance",
subtitle = paste0("Dashed line = 4/n threshold | ",
n_influential, " potentially influential observations"),
x = "Observation Number",
y = "Cook's Distance",
color = ""
) +
theme_minimal(base_size = 13) +
theme(legend.position = "top")
p_cooksQuestions:
The Residuals vs. Fitted plot shows a non-random pattern, the residuals are not evenly scattered around zero. There is clear evidence of heteroscedasticity, with the spread of residuals increasing at higher fitted values. The LOESS smoother also departs from zero, suggesting mild nonlinearity.
phys_days?The Q-Q plot shows substantial departures from normality — the points deviate sharply from the red reference line at both tails, particularly the upper tail. This suggests the residuals are right-skewed, which is a direct reflection of the skewed, zero-inflated distribution of phys_days observed in Task 1.
Yes, there are 136 potentially influential observations with Cook’s D > 4/n threshold (dashed red line), as shown in the Cook’s Distance plot. These observations are scattered across the entire observation range (0 to 3,000) with no particular clustering, and their Cook’s D values range up to approximately 0.005, which are relatively small in absolute magnitude.
Now fit a second SLR model using age as the
predictor of phys_days instead of BMI.
# (a) Fit SLR: phys_days ~ age
model_age <- lm(phys_days ~ age, data = brfss_slr)
# Summary output
summary(model_age)##
## Call:
## lm(formula = phys_days ~ age, data = brfss_slr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.872 -8.803 -4.733 9.460 23.267
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.37023 0.66608 6.561 6.27e-11 ***
## age 0.13127 0.01145 11.467 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.92 on 2998 degrees of freedom
## Multiple R-squared: 0.04202, Adjusted R-squared: 0.0417
## F-statistic: 131.5 on 1 and 2998 DF, p-value: < 2.2e-16
# (b) Display results and compare to the BMI model
tidy(model_age, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
kable(
caption = "Simple Linear Regression: Poor Physical Health Days ~ Age (BRFSS 2020)",
col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
"p-value", "95% CI Lower", "95% CI Upper"),
align = "lrrrrrrr"
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE)| Term | Estimate | Std. Error | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 4.3702 | 0.6661 | 6.5611 | 0 | 3.0642 | 5.6762 |
| age | 0.1313 | 0.0114 | 11.4675 | 0 | 0.1088 | 0.1537 |
# Comparison table: BMI model vs Age model
r2_bmi <- summary(model_slr)$r.squared
r2_age <- summary(model_age)$r.squared
slope_bmi <- tidy(model_slr, conf.int = TRUE) %>% filter(term == "bmi")
slope_age <- tidy(model_age, conf.int = TRUE) %>% filter(term == "age")
tibble(
Metric = c("Predictor", "Slope (b₁)", "SE(b₁)",
"t-statistic", "p-value",
"95% CI Lower", "95% CI Upper", "R²"),
BMI_Model = c("BMI",
round(slope_bmi$estimate, 4),
round(slope_bmi$std.error, 4),
round(slope_bmi$statistic, 3),
format.pval(slope_bmi$p.value, digits = 3),
round(slope_bmi$conf.low, 4),
round(slope_bmi$conf.high, 4),
round(r2_bmi, 4)),
Age_Model = c("Age",
round(slope_age$estimate, 4),
round(slope_age$std.error, 4),
round(slope_age$statistic, 3),
format.pval(slope_age$p.value, digits = 3),
round(slope_age$conf.low, 4),
round(slope_age$conf.high, 4),
round(r2_age, 4))
) %>%
kable(
caption = "Model Comparison: phys_days ~ BMI vs. phys_days ~ Age",
col.names = c("Metric", "BMI Model", "Age Model")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE)| Metric | BMI Model | Age Model |
|---|---|---|
| Predictor | BMI | Age |
| Slope (b₁) | 0.1451 | 0.1313 |
| SE(b₁) | 0.0289 | 0.0114 |
| t-statistic | 5.013 | 11.467 |
| p-value | 5.66e-07 | <2e-16 |
| 95% CI Lower | 0.0884 | 0.1088 |
| 95% CI Upper | 0.2019 | 0.1537 |
| R² | 0.0083 | 0.042 |
# (c) Which predictor has the stronger association? Compare R² values.
tibble(
Model = c("phys_days ~ BMI", "phys_days ~ Age"),
R_squared = c(round(r2_bmi, 4), round(r2_age, 4)),
Variance_Explained = c(paste0(round(r2_bmi * 100, 2), "%"),
paste0(round(r2_age * 100, 2), "%"))
) %>%
kable(
caption = "R² Comparison: BMI Model vs. Age Model",
col.names = c("Model", "R²", "Variance Explained")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(which.max(c(r2_bmi, r2_age)), bold = TRUE, background = "#d4edda")| Model | R² | Variance Explained |
|---|---|---|
| phys_days ~ BMI | 0.0083 | 0.83% |
| phys_days ~ Age | 0.0420 | 4.2% |
Questions:
Both BMI and age show a positive association with poor physical health days, higher BMI and older age are both associated with more poor physical health days. However the associations differ in magnitude and significance based on the model results above
phys_days?R2 for BMI model = 0.0083, explaining 0.83% of variation in phys_days
R2 for Age model = 0.0420, explaining 4.2% of variation in
phys_days
The (BMI or Age) model explains more variability in poor physical health days based on the higher R2R^2 R2 value
Based on these two simple models, both BMI and age are statistically significant but weak individual predictors of poor physical health days. Age performs modestly better than BMI (R2R^2 R2 = 0.0420 vs. 0.0083), yet even the age model leaves 95.80% of the variation in phys_days unexplained.
The limitations of using simple linear regression for this outcome include:
Violated assumptions: phys_days is bounded (0–30), zero-inflated, and right-skewed, which violates the normality and equal variance assumptions of SLR
Single predictor: restricting to one predictor ignores important confounders such as chronic disease status, sex, socioeconomic status, and health behaviors
Negative predictions: the prediction interval produced negative values, which are impossible for this outcome
Nonlinearity: the LOESS smoother in Task 1 suggested the relationship may not be strictly linear across the full BMI range
Submit your completed .Rmd file and the RPubs
link to your knitted HTML document.
Your .Rmd must knit without errors. Make sure all code
chunks produce visible output and all questions are answered in complete
sentences below each code chunk.
Due: Before the next class session.
## R version 4.5.2 (2025-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] gtsummary_2.5.0 ggeffects_2.3.2 broom_1.0.11 plotly_4.12.0
## [5] kableExtra_1.4.0 knitr_1.51 here_1.0.2 haven_2.5.5
## [9] lubridate_1.9.4 forcats_1.0.1 stringr_1.6.0 dplyr_1.2.0
## [13] purrr_1.2.1 readr_2.1.6 tidyr_1.3.2 tibble_3.3.1
## [17] ggplot2_4.0.0 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 xfun_0.56 bslib_0.9.0 htmlwidgets_1.6.4
## [5] insight_1.4.4 lattice_0.22-7 tzdb_0.5.0 crosstalk_1.2.2
## [9] vctrs_0.7.1 tools_4.5.2 generics_0.1.4 pkgconfig_2.0.3
## [13] Matrix_1.7-4 data.table_1.18.0 RColorBrewer_1.1-3 S7_0.2.0
## [17] gt_1.3.0 lifecycle_1.0.5 compiler_4.5.2 farver_2.1.2
## [21] textshaping_1.0.4 janitor_2.2.1 snakecase_0.11.1 litedown_0.9
## [25] htmltools_0.5.9 sass_0.4.10 yaml_2.3.12 lazyeval_0.2.2
## [29] pillar_1.11.1 jquerylib_0.1.4 cachem_1.1.0 nlme_3.1-168
## [33] commonmark_2.0.0 tidyselect_1.2.1 digest_0.6.39 stringi_1.8.7
## [37] labeling_0.4.3 splines_4.5.2 rprojroot_2.1.1 fastmap_1.2.0
## [41] grid_4.5.2 cli_3.6.5 magrittr_2.0.4 cards_0.7.1
## [45] withr_3.0.2 scales_1.4.0 backports_1.5.0 timechange_0.3.0
## [49] rmarkdown_2.30 httr_1.4.7 otel_0.2.0 hms_1.1.4
## [53] evaluate_1.0.5 viridisLite_0.4.2 mgcv_1.9-3 markdown_2.0
## [57] rlang_1.1.7 glue_1.8.0 xml2_1.5.2 svglite_2.2.2
## [61] rstudioapi_0.18.0 jsonlite_2.0.0 R6_2.6.1 systemfonts_1.3.1
## [65] fs_1.6.6