Simple Linear Regression (SLR) is one of the most fundamental and widely used tools in epidemiology and public health research. It allows us to:
In epidemiology, we frequently use SLR to model continuous outcomes such as blood pressure, BMI, cholesterol levels, or hospital length of stay as a function of age, exposure levels, or other continuous predictors.
Summary of Key Formulas
| Quantity | Formula |
|---|---|
| Slope | \(b_1 = S_{XY} / S_{XX}\) |
| Intercept | \(b_0 = \bar{Y} - b_1 \bar{X}\) |
| SSTotal | \(\sum(Y_i - \bar{Y})^2\) |
| SSRegression | \(\sum(\hat{Y}_i - \bar{Y})^2\) |
| SSResidual | \(\sum(Y_i - \hat{Y}_i)^2\) |
| MSE | \(SS_{Residual} / (n-2)\) |
| \(R^2\) | \(SS_{Reg} / SS_{Total}\) |
| \(SE(b_1)\) | \(\hat{\sigma}/\sqrt{S_{XX}}\) |
| t-statistic | \(b_1 / SE(b_1)\) |
| 95% CI for \(\beta_1\) | \(b_1 \pm t_{n-2, 0.025} \cdot SE(b_1)\) |
In this lab, you will apply Simple Linear Regression to the
BRFSS 2020 dataset using a different outcome variable:
number of days of poor physical health in the past 30
days (phys_days). You will model it as a
continuous outcome predicted by BMI.
Research Question: Is BMI associated with the number of days of poor physical health among U.S. adults?
Use the code below to load the data. The dataset is the same one used in the lecture — you only need to load it once.
# Load packages
library(tidyverse)
library(haven)
library(here)
library(knitr)
library(kableExtra)
library(broom)
library(gtsummary)
library(plotly)
library(ggeffects)
# Load the RDS file
brfss_lab <- readRDS("brfss_slr_2020.rds")
# Check that it worked
glimpse(brfss_lab)
summary(brfss_lab)
# Select variables of interest
brfss_lab <- brfss_full %>%
select(bmi5, age80, sex, physhlth, sleptim1)
# Recode variables
brfss_lab <- brfss_lab %>%
mutate(
bmi = bmi5 / 100,
age = age80,
sex = factor(ifelse(sex == 1, "Male", "Female")),
phys_days = ifelse(physhlth %in% 0:30, physhlth, NA_real_),
sleep_hrs = ifelse(sleptim1 %in% 1:24, sleptim1, NA_real_)
)
# Select recoded variables, apply filters, drop missing, take sample
set.seed(553)
brfss_lab <- brfss_lab %>%
select(bmi, age, sex, phys_days, sleep_hrs) %>%
filter(bmi > 14.5, bmi < 60, age >= 18, age <= 80) %>%
drop_na() %>%
slice_sample(n = 3000)# (a) Create a summary table of phys_days and bmi
# Load the data
brfss_lab <- readRDS("brfss_slr_2020.rds")
# Load packages in this chunk
library(tidyverse)
library(knitr)
library(kableExtra)
brfss_lab %>%
select(phys_days, bmi) %>%
summary() %>%
kable(caption = "Descriptive Statistics: Poor Physical Health Days and BMI") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| phys_days | bmi | |
|---|---|---|
| Min. : 1.00 | Min. :14.63 | |
| 1st Qu.: 2.00 | 1st Qu.:24.32 | |
| Median : 6.00 | Median :27.89 | |
| Mean :11.66 | Mean :29.18 | |
| 3rd Qu.:20.00 | 3rd Qu.:32.89 | |
| Max. :30.00 | Max. :59.60 |
brfss_lab %>%
select(phys_days, bmi) %>%
summary() %>%
kable(caption = "Descriptive Statistics: Poor Physical Health Days and BMI") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| phys_days | bmi | |
|---|---|---|
| Min. : 1.00 | Min. :14.63 | |
| 1st Qu.: 2.00 | 1st Qu.:24.32 | |
| Median : 6.00 | Median :27.89 | |
| Mean :11.66 | Mean :29.18 | |
| 3rd Qu.:20.00 | 3rd Qu.:32.89 | |
| Max. :30.00 | Max. :59.60 |
# (b) Create a histogram of phys_days — describe the distribution
p_hist_phys <- ggplot(brfss_lab, aes(x = phys_days)) +
geom_histogram(aes(y = after_stat(density)), bins = 40,
fill = "steelblue", color = "white", alpha = 0.8) +
geom_density(color = "red", linewidth = 1) +
labs(
title = "Distribution of Poor Physical Health Days",
subtitle = "Days in past 30 days with poor physical health",
x = "Number of Days",
y = "Density"
) +
theme_minimal(base_size = 13)
# Display the plot
p_hist_phys# (c) Create a scatter plot of phys_days (Y) vs bmi (X)
p_scatter_lab <- ggplot(brfss_lab, aes(x = bmi, y = phys_days)) +
geom_point(alpha = 0.15, color = "steelblue", size = 1.2) +
geom_smooth(method = "lm", color = "red", linewidth = 1.2, se = TRUE) +
geom_smooth(method = "loess", color = "orange", linewidth = 1,
linetype = "dashed", se = FALSE) +
labs(
title = "Poor Physical Health Days vs. BMI",
subtitle = "Red = Linear fit | Orange dashed = LOESS smoother",
x = "BMI (kg/m²)",
y = "Poor Physical Health Days (past 30)"
) +
theme_minimal(base_size = 13)
# Display the plot
p_scatter_lab# Descriptive Statistics
brfss_lab %>%
select(phys_days, bmi) %>%
summary() %>%
kable(caption = "Descriptive Statistics: Poor Physical Health Days and BMI") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| phys_days | bmi | |
|---|---|---|
| Min. : 1.00 | Min. :14.63 | |
| 1st Qu.: 2.00 | 1st Qu.:24.32 | |
| Median : 6.00 | Median :27.89 | |
| Mean :11.66 | Mean :29.18 | |
| 3rd Qu.:20.00 | 3rd Qu.:32.89 | |
| Max. :30.00 | Max. :59.60 |
# Calculate actual standard deviation
sd_phys <- sd(brfss_lab$phys_days, na.rm = TRUE)
sd_bmi <- sd(brfss_lab$bmi, na.rm = TRUE)
cat("Standard deviation for phys_days:", round(sd_phys, 2), "\n")## Standard deviation for phys_days: 11.16
## Standard deviation for BMI: 7.01
# Create a more complete table
brfss_lab %>%
select(phys_days, bmi) %>%
summarise(
Variable = c("phys_days", "bmi"),
Mean = c(mean(phys_days), mean(bmi)),
SD = c(sd(phys_days), sd(bmi)),
Median = c(median(phys_days), median(bmi)),
Min = c(min(phys_days), min(bmi)),
Max = c(max(phys_days), max(bmi))
) %>%
kable(caption = "Complete Descriptive Statistics", digits = 2) %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Variable | Mean | SD | Median | Min | Max |
|---|---|---|---|---|---|
| phys_days | 11.66 | 11.16 | 6.00 | 1.00 | 30.0 |
| bmi | 29.18 | 7.01 | 27.89 | 14.63 | 59.6 |
Questions:
phys_days?
Of bmi? What do you notice about the distribution of
phys_days?Descriptive Statistics:
Poor Physical Health Days (phys_days): Mean = 12.03 days, Standard Deviation = 11.21 days
Body Mass Index (bmi): Mean = 29.47 kg/m², Standard Deviation = 6.87 kg/m²
The relationship appears approximately linear – the LOESS smoother closely follows the linear fit.
Obvious outliers include:
Individuals with high BMI (>35) reporting few poor health days (1-5 days)
Individuals with normal BMI (20-25) reporting 30 poor health days
A ceiling effect at 30 days across all BMI values
# Load ALL required packages
library(tidyverse) # For %>% pipe, ggplot, dplyr
library(knitr) # For kable tables
library(kableExtra) # For table styling
library(broom) # For tidy() function
# (a) Fit the SLR model: phys_days ~ bmi
model_phys_bmi <- lm(phys_days ~ bmi, data = brfss_lab)
# View summary output
summary(model_phys_bmi)##
## Call:
## lm(formula = phys_days ~ bmi, data = brfss_lab)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.808 -9.160 -5.623 8.943 20.453
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.42285 0.86881 8.544 < 2e-16 ***
## bmi 0.14513 0.02895 5.013 5.66e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.12 on 2998 degrees of freedom
## Multiple R-squared: 0.008314, Adjusted R-squared: 0.007983
## F-statistic: 25.13 on 1 and 2998 DF, p-value: 5.659e-07
# (b) Display a tidy coefficient table with 95% CIs
tidy(model_phys_bmi, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
kable(
caption = "Table 2: Simple Linear Regression - Poor Physical Health Days ~ BMI",
col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
"p-value", "95% CI Lower", "95% CI Upper"),
align = "lrrrrrrr"
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE)| Term | Estimate | Std. Error | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 7.4228 | 0.8688 | 8.5437 | 0 | 5.7193 | 9.1264 |
| bmi | 0.1451 | 0.0289 | 5.0134 | 0 | 0.0884 | 0.2019 |
# (c) Extract and report: slope, intercept, t-statistic, p-value
# Extract coefficients
b0 <- round(coef(model_phys_bmi)[1], 3) # Intercept
b1 <- round(coef(model_phys_bmi)[2], 4) # Slope
# Extract test statistics
t_stat <- round(summary(model_phys_bmi)$coefficients[2, 3], 3)
p_val <- summary(model_phys_bmi)$coefficients[2, 4]
# Create summary table
tibble(
Quantity = c("Intercept (b₀)", "Slope (b₁)", "t-statistic", "p-value"),
Value = c(b0, b1, t_stat, format.pval(p_val, digits = 4))
) %>%
kable(caption = "Table 3: Key Model Statistics") %>%
kable_styling(bootstrap_options = c("striped"), full_width = FALSE)| Quantity | Value |
|---|---|
| Intercept (b₀) | 7.423 |
| Slope (b₁) | 0.1451 |
| t-statistic | 5.013 |
| p-value | 5.659e-07 |
Questions:
\(\widehat{\text{phys_days}} = 7.519 + 0.1531 \times \text{bmi}\)
Each 1-unit increase in BMI is associated with an average increase of 0.1531 poor physical health days (about 4.6 hours).
Is the intercept (\(b_0\)) interpretable in this context? Why or why not? No. \(b_0 = 7.519\) represents predicted poor health days when BMI = 0, which is impossible.
Is the association statistically significant at \(\alpha = 0.05\)? State the null hypothesis, test statistic, and p-value.
\(H_0: \beta_1 = 0\) (no linear relationship)
\(H_A: \beta_1 \neq 0\) (linear relationship exists)
Test statistic: \(t = 5.158\)
p-value = \(2.65 \times 10^{-7}\)
Since p-value < 0.05, reject \(H_0\). There is significant evidence of a positive linear association between BMI and poor physical health days.
# (a) Display the ANOVA table
anova_phys <- anova(model_phys_bmi)
anova_phys %>%
kable(caption = "Table 4: ANOVA Table - Poor Physical Health Days ~ BMI",
digits = 3) %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| bmi | 1 | 3105.365 | 3105.365 | 25.134 | 0 |
| Residuals | 2998 | 370411.743 | 123.553 | NA | NA |
# (b) Compute and report SSTotal, SSRegression, and SSResidual
ss_reg <- anova_phys$`Sum Sq`[1]
ss_res <- anova_phys$`Sum Sq`[2]
ss_total <- ss_reg + ss_res
# (c) Compute R² two ways: from the model object and from the SS decomposition
r_sq <- summary(model_phys_bmi)$r.squared
r_sq_calc <- ss_reg / ss_total
ss_total <- ss_reg + ss_res
# Display results
tibble(
Source = c("Regression", "Residual", "Total"),
`Sum Sq` = c(round(ss_reg, 2), round(ss_res, 2), round(ss_total, 2)),
df = c(1, 2998, 2999),
`Mean Sq` = c(round(ss_reg/1, 2), round(ss_res/2998, 2), NA)
) %>%
kable(caption = "Table 5: Sums of Squares Decomposition") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Source | Sum Sq | df | Mean Sq |
|---|---|---|---|
| Regression | 3105.36 | 1 | 3105.36 |
| Residual | 370411.74 | 2998 | 123.55 |
| Total | 373517.11 | 2999 | NA |
##
## R² = 0.0083 ( 0.83 %)
## R² (calculated from SS) = 0.0083
## ========================================
## R² RESULTS
## ========================================
## R² = 0.0083 ( 0.83 %)
## R² (calculated from SS) = 0.0083
## SS Total: 373517.1
## SS Regression: 3105.36
## SS Residual: 370411.7
## F-statistic: 25.134
Questions:
\(SS_{Total} = 376,802.40\)
\(SS_{Regression} = 3,314.97\)
\(SS_{Residual} = 373,487.40\)
\(df_{Regression} = 1\)
\(df_{Residual} = 2,998\)
\(df_{Total} = 2,999\)
\(F\)-statistic = 26.609
R² = 0.0088 (0.88%)
Only 0.88% of the variability in poor physical health days is explained by BMI alone. This means that BMI accounts for less than 1% of the differences we observe in how many poor physical health days people report.
This very low R² tells us that BMI alone is a poor predictor of poor physical health days. While BMI is statistically significant (p < 0.001 from the F-test), it explains virtually none of the real-world variation in this health outcome.
The other 99.12% is explained by factors not in the model: chronic conditions, mental health, health behaviors, healthcare access, socioeconomic factors, demographics, and genetics.
# (a) Calculate the fitted BMI value and 95% CI for a person with BMI = 25
new_bmi <- data.frame(bmi = 25)
ci_25 <- predict(model_phys_bmi, newdata = new_bmi, interval = "confidence", level = 0.95) %>%
as.data.frame() %>%
mutate(BMI = 25, Type = "Confidence Interval")
# (b) Calculate the 95% prediction interval for a person with BMI = 25
pi_25 <- predict(model_phys_bmi, newdata = new_bmi, interval = "prediction", level = 0.95) %>%
as.data.frame() %>%
mutate(BMI = 25, Type = "Prediction Interval") %>%
select(-fit)
# Combine results
results_25 <- bind_cols(
BMI = 25,
Fitted = round(ci_25$fit, 2),
CI_Lower = round(ci_25$lwr, 2),
CI_Upper = round(ci_25$upr, 2),
PI_Lower = round(pi_25$lwr, 2),
PI_Upper = round(pi_25$upr, 2)
)
# Display table
results_25 %>%
kable(
caption = "Table 6: 95% Confidence and Prediction Intervals for BMI = 25",
col.names = c("BMI", "Fitted Value", "CI Lower", "CI Upper", "PI Lower", "PI Upper")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| BMI | Fitted Value | CI Lower | CI Upper | PI Lower | PI Upper |
|---|---|---|---|---|---|
| 25 | 11.05 | 10.59 | 11.51 | -10.75 | 32.85 |
# (c) Plot the regression line with both the CI band and PI band
bmi_grid <- data.frame(bmi = seq(15, 55, length.out = 200))
ci_band <- predict(model_phys_bmi, newdata = bmi_grid, interval = "confidence") %>%
as.data.frame() %>%
bind_cols(bmi_grid)
pi_band <- predict(model_phys_bmi, newdata = bmi_grid, interval = "prediction") %>%
as.data.frame() %>%
bind_cols(bmi_grid)
p_ci_pi <- ggplot() +
geom_point(data = brfss_lab, aes(x = bmi, y = phys_days),
alpha = 0.15, color = "steelblue", size = 1) +
geom_ribbon(data = pi_band, aes(x = bmi, ymin = lwr, ymax = upr),
fill = "lightblue", alpha = 0.3) +
geom_ribbon(data = ci_band, aes(x = bmi, ymin = lwr, ymax = upr),
fill = "steelblue", alpha = 0.4) +
geom_line(data = ci_band, aes(x = bmi, y = fit),
color = "red", linewidth = 1.2) +
labs(
title = "Figure 3: Poor Physical Health Days ~ BMI",
subtitle = "Dark band = 95% CI for mean | Light band = 95% PI for individual",
x = "BMI (kg/m²)",
y = "Poor Physical Health Days (past 30)"
) +
theme_minimal(base_size = 13)
p_ci_piQuestions:
For all adults with BMI = 25, the estimated average number of poor physical health days is 11.35 days. I am 95% confident that the true population mean falls between 10.87 and 11.82 days.
For a specific new individual with BMI = 25, it is 95% confident that their actual number of poor physical health days will fall between -10.54 and 33.24 days.
The prediction interval is wider than the confidence interval because it accounts for two sources of uncertainty:
Confidence interval only accounts for uncertainty in estimating the population mean
Prediction interval adds the individual variability (error term ε) around that mean
When to use each in practice:
Use a confidence interval when you want to estimate the average outcome for a group (e.g., “What’s the average poor health days for all 45-year-olds with BMI 25?”)
Use a prediction interval when you want to predict a single individual’s outcome (e.g., “If a specific patient has BMI 25, how many poor health days might they experience?”)
# (a) Produce the four standard diagnostic plots (use par(mfrow = c(2,2)) and plot())
par(mfrow = c(2, 2))
plot(model_phys_bmi, which = 1:4,
col = adjustcolor("steelblue", 0.4),
pch = 19, cex = 0.6)par(mfrow = c(1, 1))
# (b) Create a residuals vs. fitted plot using ggplot
# Augment dataset with residuals
augmented <- augment(model_phys_bmi)
p_resid_fitted <- ggplot(augmented, aes(x = .fitted, y = .resid)) +
geom_point(alpha = 0.15, color = "steelblue", size = 1) +
geom_hline(yintercept = 0, color = "red", linewidth = 1) +
geom_smooth(method = "loess", color = "orange", se = FALSE, linewidth = 1) +
labs(
title = "Figure 4: Residuals vs. Fitted Values",
subtitle = "Should show random scatter around zero",
x = "Fitted Values",
y = "Residuals"
) +
theme_minimal(base_size = 13)
p_resid_fitted# (c) Create a normal Q-Q plot of residuals using ggplot
p_qq <- ggplot(augmented, aes(sample = .resid)) +
stat_qq(color = "steelblue", alpha = 0.3, size = 1) +
stat_qq_line(color = "red", linewidth = 1) +
labs(
title = "Figure 5: Normal Q-Q Plot of Residuals",
subtitle = "Points should lie on red line if residuals are normal",
x = "Theoretical Quantiles",
y = "Sample Quantiles"
) +
theme_minimal(base_size = 13)
p_qq# (d) Create a Cook's distance plot
n <- nrow(brfss_lab)
augmented <- augmented %>%
mutate(
obs_num = row_number(),
cooks_d = cooks.distance(model_phys_bmi),
influential = ifelse(cooks_d > 4/n, "Potentially influential", "Not influential")
)
n_influential <- sum(augmented$cooks_d > 4/n)
p_cooks <- ggplot(augmented, aes(x = obs_num, y = cooks_d, color = influential)) +
geom_point(alpha = 0.6, size = 1.2) +
geom_hline(yintercept = 4/n, linetype = "dashed",
color = "red", linewidth = 1) +
scale_color_manual(values = c("Potentially influential" = "tomato",
"Not influential" = "steelblue")) +
labs(
title = "Figure 6: Cook's Distance",
subtitle = paste0("Dashed line = 4/n threshold | ",
n_influential, " potentially influential observations"),
x = "Observation Number",
y = "Cook's Distance",
color = ""
) +
theme_minimal(base_size = 13) +
theme(legend.position = "top")
p_cooksQuestions:
Yes – there is a clear fan-shaped pattern (increasing spread) indicating heteroscedasticity, and the residuals are not randomly scattered around zero, suggesting nonlinearity.
phys_days?No, residuals are not normal. The S-shaped deviation from the red line indicates heavy tails and skewness, reflecting the bounded count nature of phys_days (1-30 days).
Yes – 133 observations have Cook’s D > 4/n. These should be examined for data errors and a sensitivity analysis conducted, but not automatically deleted.
Assumption Met
Linearity ❌ No Independence ✅ Yes Normality ❌ No Equal variance ❌ No
Most problematic: Normality and equal variance, due to phys_days being a bounded count outcome (1-30 days) with a ceiling effect. Linear regression is misspecified here; Poisson or negative binomial models would be more appropriate.
Now fit a second SLR model using age as the
predictor of phys_days instead of BMI.
# (a) Fit SLR: phys_days ~ age
model_phys_age <- lm(phys_days ~ age, data = brfss_lab)
# (b) Display results and compare to the BMI model
tidy(model_phys_age, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
kable(
caption = "Table 7: SLR - Poor Physical Health Days ~ Age",
col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
"p-value", "95% CI Lower", "95% CI Upper")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | Std. Error | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 4.3702 | 0.6661 | 6.5611 | 0 | 3.0642 | 5.6762 |
| age | 0.1313 | 0.0114 | 11.4675 | 0 | 0.1088 | 0.1537 |
# (c) Which predictor has the stronger association? Compare R² values.
r2_bmi <- summary(model_phys_bmi)$r.squared
r2_age <- summary(model_phys_age)$r.squared
# Display comparison
tibble(
Model = c("BMI as Predictor", "Age as Predictor"),
R_squared = c(round(r2_bmi, 4), round(r2_age, 4)),
Percentage = c(paste0(round(r2_bmi * 100, 2), "%"),
paste0(round(r2_age * 100, 2), "%")),
Slope = c(round(coef(model_phys_bmi)[2], 4),
round(coef(model_phys_age)[2], 4)),
p_value = c(format.pval(summary(model_phys_bmi)$coefficients[2, 4], digits = 3),
format.pval(summary(model_phys_age)$coefficients[2, 4], digits = 3))
) %>%
kable(caption = "Table 8: Model Comparison - BMI vs. Age") %>%
kable_styling(bootstrap_options = c("striped"), full_width = FALSE)| Model | R_squared | Percentage | Slope | p_value |
|---|---|---|---|---|
| BMI as Predictor | 0.0083 | 0.83% | 0.1451 | 5.66e-07 |
| Age as Predictor | 0.0420 | 4.2% | 0.1313 | <2e-16 |
# Scatter plot for age
p_age <- ggplot(brfss_lab, aes(x = age, y = phys_days)) +
geom_jitter(alpha = 0.15, color = "darkgreen", width = 0.5, height = 0) +
geom_smooth(method = "lm", color = "red", linewidth = 1.2, se = TRUE) +
labs(
title = "Figure 7: Poor Physical Health Days vs. Age",
x = "Age (years)",
y = "Poor Physical Health Days (past 30)"
) +
theme_minimal(base_size = 13)
p_ageQuestions:
Both BMI and age show positive, statistically significant associations (p < 0.001). BMI has a slightly larger slope (0.1531 vs. 0.1377), meaning a slightly stronger effect per unit change.
phys_days?BMI model: 0.0088 (0.88%)
Age model: 0.0443 (4.43%)
Age explains more variability in poor physical health days than BMI (4.43% vs. 0.88%).
Conclusion: Both are significant predictors, but age is stronger. However, neither explains much variation (< 5%), showing poor physical health is multifactorial.
Limitations:
Non-normal outcome (bounded count 1-30)
Heteroscedasticity and ceiling effects at 30 days
No confounders – simple models omit key variables
Linear regression is misspecified for count outcomes; Poisson or negative binomial models would be better.
Submit your completed .Rmd file and the RPubs
link to your knitted HTML document.
Your .Rmd must knit without errors. Make sure all code
chunks produce visible output and all questions are answered in complete
sentences below each code chunk.
## R version 4.4.2 (2024-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 10 x64 (build 19045)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] broom_1.0.12 kableExtra_1.4.0 knitr_1.51 lubridate_1.9.3
## [5] forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2
## [9] readr_2.1.5 tidyr_1.3.1 tibble_3.2.1 ggplot2_4.0.2
## [13] tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.10 generics_0.1.4 xml2_1.3.6 lattice_0.22-6
## [5] stringi_1.8.4 hms_1.1.4 digest_0.6.37 magrittr_2.0.3
## [9] evaluate_1.0.5 grid_4.4.2 timechange_0.3.0 RColorBrewer_1.1-3
## [13] fastmap_1.2.0 Matrix_1.7-1 jsonlite_2.0.0 backports_1.5.0
## [17] mgcv_1.9-1 viridisLite_0.4.3 scales_1.4.0 textshaping_0.4.0
## [21] jquerylib_0.1.4 cli_3.6.3 rlang_1.1.4 splines_4.4.2
## [25] withr_3.0.2 cachem_1.1.0 yaml_2.3.10 otel_0.2.0
## [29] tools_4.4.2 tzdb_0.4.0 vctrs_0.6.5 R6_2.6.1
## [33] lifecycle_1.0.5 pkgconfig_2.0.3 pillar_1.11.1 bslib_0.10.0
## [37] gtable_0.3.6 glue_1.8.0 systemfonts_1.3.1 xfun_0.56
## [41] tidyselect_1.2.1 rstudioapi_0.18.0 farver_2.1.2 nlme_3.1-166
## [45] htmltools_0.5.8.1 rmarkdown_2.30 svglite_2.2.2 labeling_0.4.3
## [49] compiler_4.4.2 S7_0.2.1