This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
You can also embed plots, for example:
Note that the echo = FALSE parameter was added to the
code chunk to prevent printing of the R code that generated the plot. #
Setup and Data
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.2
## Warning: package 'tibble' was built under R version 4.5.2
## Warning: package 'tidyr' was built under R version 4.5.2
## Warning: package 'readr' was built under R version 4.5.2
## Warning: package 'purrr' was built under R version 4.5.2
## Warning: package 'dplyr' was built under R version 4.5.2
## Warning: package 'stringr' was built under R version 4.5.2
## Warning: package 'forcats' was built under R version 4.5.2
## Warning: package 'lubridate' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(haven)
## Warning: package 'haven' was built under R version 4.5.2
library(here)
## Warning: package 'here' was built under R version 4.5.2
## here() starts at C:/Users/safwa/OneDrive - University at Albany - SUNY/EPI 553/Labs
library(knitr)
## Warning: package 'knitr' was built under R version 4.5.2
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.5.2
##
## Attaching package: 'kableExtra'
##
## The following object is masked from 'package:dplyr':
##
## group_rows
library(plotly)
## Warning: package 'plotly' was built under R version 4.5.2
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(broom)
library(ggeffects)
## Warning: package 'ggeffects' was built under R version 4.5.2
library(gtsummary)
## Warning: package 'gtsummary' was built under R version 4.5.2
# Load raw BRFSS 2020 data
brfss_full <- read_rds("C:\\Users\\safwa\\OneDrive - University at Albany - SUNY\\EPI 553\\Labs\\brfss_slr_2020.rds") %>%
janitor::clean_names()
# Display dataset dimensions
names(brfss_full)
## [1] "bmi" "age" "sex" "education"
## [5] "gen_health_num" "sleep_hrs" "phys_days"
# (a) Create a summary table of phys_days and bmi
summary(brfss_full[, c("phys_days", "bmi")])
## phys_days bmi
## Min. : 1.00 Min. :14.63
## 1st Qu.: 2.00 1st Qu.:24.32
## Median : 6.00 Median :27.89
## Mean :11.66 Mean :29.18
## 3rd Qu.:20.00 3rd Qu.:32.89
## Max. :30.00 Max. :59.60
# (b) Create a histogram of phys_days — describe the distribution
ggplot(brfss_full, aes(x = phys_days)) +
geom_histogram(fill = "steelblue", color = "white", bins = 30) +
labs(title = "Histogram of Physical Days", x = "Physical Days", y = "Frequency")
# (c) Create a scatter plot of phys_days (Y) vs bmi (X)
ggplot(brfss_full, aes(x = bmi, y = phys_days)) +
geom_point(color = "steelblue", alpha = 0.6) +
labs(title = "Physical Days vs BMI", x = "BMI", y = "Physical Days")
Questions:
What is the mean and standard deviation of
phys_days? Of bmi? What do you notice about
the distribution of phys_days? Ans: Mean phys_days = 11.66,
SD = 11.16. Mean bmi = 29.18, SD = 7.01.phys_days is strongly
right-skewed. The median (6) is far below the mean (11.66). Most
respondents report few poor health days, but a long tail extends to 30
days. This non-normality is a key concern for OLS assumptions.
Based on the scatter plot, does the relationship between BMI and poor physical health days appear to be linear? Are there any obvious outliers? Ans : The relationship is very weakly positive with enormous scatter. No strong linear pattern is visually evident. Many high-residual points at phys_days = 30 (ceiling) exist across all BMI levels.
# (a) Fit the SLR model: phys_days ~ bmi
model_slr <- lm(phys_days ~ bmi, data = brfss_full)
# Summary output
summary(model_slr)
##
## Call:
## lm(formula = phys_days ~ bmi, data = brfss_full)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.808 -9.160 -5.623 8.943 20.453
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.42285 0.86881 8.544 < 2e-16 ***
## bmi 0.14513 0.02895 5.013 5.66e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.12 on 2998 degrees of freedom
## Multiple R-squared: 0.008314, Adjusted R-squared: 0.007983
## F-statistic: 25.13 on 1 and 2998 DF, p-value: 5.659e-07
# (b) Display a tidy coefficient table with 95% CIs
tidy(model_slr, conf.int = TRUE, conf.level = 0.95)
## # A tibble: 2 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 7.42 0.869 8.54 2.04e-17 5.72 9.13
## 2 bmi 0.145 0.0289 5.01 5.66e- 7 0.0884 0.202
# (c) Extract and report: slope, intercept, t-statistic, p-value
coef_table <- tidy(model_slr, conf.int = TRUE)
intercept <- coef_table$estimate[1] # b0 = 7.4228
slope <- coef_table$estimate[2] # b1 = 0.1451
t_slope <- coef_table$statistic[2] # t = 5.01
p_slope <- coef_table$p.value[2] # p = 5.66e-07
Questions:
Write the fitted regression equation in the form \(\hat{Y} = b_0 + b_1 X\). Ans:Fitted equation: phys_days_hat = 7.4228 + 0.1451 × BMI
Interpret the slope (\(b_1\)) in context — what does it mean in plain English? Ans:Slope = 0.1451: For each 1-unit increase in BMI, expected poor physical health days increases by ~0.145 days, on average. Positive but practically small effect.
Is the intercept (\(b_0\)) interpretable in this context? Why or why not? Ans:Intercept (7.42) is NOT interpretable — a BMI of 0 is physiologically impossible. It serves only as a mathematical anchor.
Is the association statistically significant at \(\alpha = 0.05\)? State the null hypothesis, test statistic, and p-value. Ans:H₀: β₁ = 0. t = 5.01, p = 5.66×10⁻⁷ < 0.05 → Reject H₀. Statistically significant association. —
# (a) Display the ANOVA table
anova_slr <- anova(model_slr)
anova_slr %>%
kable(
caption = "ANOVA Table: phys_days ~ bmi",
digits = 3,
col.names = c("Source", "Df", "Sum Sq", "Mean Sq", "F value", "Pr(>F)")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| Source | Df | Sum Sq | Mean Sq | F value | Pr(>F) |
|---|---|---|---|---|---|
| bmi | 1 | 3105.365 | 3105.365 | 25.134 | 0 |
| Residuals | 2998 | 370411.743 | 123.553 | NA | NA |
# (b) Compute and report SSTotal, SSRegression, and SSResidual
y <- brfss_full$phys_days[!is.na(brfss_full$phys_days) & !is.na(brfss_full$bmi)]
y_hat <- fitted(model_slr)
y_bar <- mean(y)
SS_Total <- sum((y - y_bar)^2)
SS_Regression <- sum((y_hat - y_bar)^2)
SS_Residual <- sum((y - y_hat)^2)
cat("SS_Total =", round(SS_Total, 4), "\n")
## SS_Total = 373517.1
cat("SS_Regression =", round(SS_Regression, 4), "\n")
## SS_Regression = 3105.365
cat("SS_Residual =", round(SS_Residual, 4), "\n")
## SS_Residual = 370411.7
# (c) Compute R² two ways: from the model object and from the SS decomposition
R2_from_model <- summary(model_slr)$r.squared
R2_from_SS <- SS_Regression / SS_Total
cat("R² from model:", round(R2_from_model, 6), "\n")
## R² from model: 0.008314
cat("R² from SS: ", round(R2_from_SS, 6), "\n")
## R² from SS: 0.008314
Questions:
Fill in the ANOVA table components: \(SS_{Total}\), \(SS_{Regression}\), \(SS_{Residual}\), \(df\), and \(F\)-statistic. Ans:SS_Total = 373,517.11 | SS_Regression = 3,105.36 | SS_Residual = 370,411.74 df_reg = 1 | df_res = 2998 | F = 25.13 R² = 0.0083 (both methods agree)
What is the \(R^2\) value? Interpret it in plain English. Ans: R² = 0.0083 — BMI explains only 0.83% of the variability in phys_days. While statistically significant, BMI is a very poor standalone predictor.
What does this \(R^2\) tell you about how well BMI alone explains variation in poor physical health days? What might explain the remaining variation? Ans:The remaining 99.17% of variation is attributable to other factors: chronic illness, age, mental health, SES, healthcare access, etc. —
# (a) Calculate the fitted BMI value and 95% CI for a person with BMI = 25
new_data <- data.frame(bmi = 25)
ci_bmi25 <- predict(model_slr, newdata = new_data,
interval = "confidence", level = 0.95)
print(ci_bmi25)
## fit lwr upr
## 1 11.05108 10.58774 11.51441
# (b) Calculate the 95% prediction interval for a person with BMI = 25
pi_bmi25 <- predict(model_slr, newdata = new_data,
interval = "prediction", level = 0.95)
print(pi_bmi25)
## fit lwr upr
## 1 11.05108 -10.7485 32.85066
# (c) Plot the regression line with both the CI band and PI band
bmi_seq <- data.frame(bmi = seq(min(brfss_full$bmi, na.rm=TRUE),
max(brfss_full$bmi, na.rm=TRUE), length.out = 300))
ci_band <- predict(model_slr, newdata = bmi_seq, interval = "confidence")
pi_band <- predict(model_slr, newdata = bmi_seq, interval = "prediction")
plot_data <- data.frame(
bmi = bmi_seq$bmi,
fit = ci_band[,"fit"],
ci_lwr = ci_band[,"lwr"],
ci_upr = ci_band[,"upr"],
pi_lwr = pi_band[,"lwr"],
pi_upr = pi_band[,"upr"]
)
ggplot() +
geom_point(data = brfss_full, aes(x = bmi, y = phys_days),
alpha = 0.15, size = 1.2, color = "steelblue") +
geom_ribbon(data = plot_data, aes(x = bmi, ymin = pi_lwr, ymax = pi_upr),
fill = "orange", alpha = 0.20) +
geom_ribbon(data = plot_data, aes(x = bmi, ymin = ci_lwr, ymax = ci_upr),
fill = "red", alpha = 0.35) +
geom_line(data = plot_data, aes(x = bmi, y = fit),
color = "red", size = 1.2) +
geom_vline(xintercept = 25, linetype = "dashed",
color = "darkgreen", alpha = 0.7) +
labs(
title = "SLR: phys_days ~ BMI with 95% CI and PI Bands",
subtitle = "Red band = 95% CI (mean), Orange band = 95% PI (individual)",
x = "BMI", y = "Poor Physical Health Days"
) +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Questions:
For someone with a BMI of 25, what is the estimated mean number of poor physical health days? What is the 95% confidence interval for this mean? Ans: Mean phys_days at BMI=25 = 11.05 days, 95% CI: (10.59, 11.51).
If a specific new person has a BMI of 25, what is the 95% prediction interval for their number of poor physical health days? Ans: Individual PI at BMI=25: (−10.75, 32.85). The negative lower bound is impossible — reflecting serious model assumption violations.
Explain in your own words why the prediction interval is wider than the confidence interval. When would you use each one in practice? Ans; PI is wider because it includes both uncertainty about the mean (CI) AND natural individual variation. Use CI when estimating average group outcomes; use PI when predicting a specific individual. —
# (a) Produce the four standard diagnostic plots (use par(mfrow = c(2,2)) and plot())
par(mfrow = c(2, 2))
plot(model_slr)
par(mfrow = c(1, 1))
# (b) Create a residuals vs. fitted plot using ggplot
diag_data <- augment(model_slr) # broom::augment
ggplot(diag_data, aes(x = .fitted, y = .resid)) +
geom_point(alpha = 0.2, size = 1.5, color = "steelblue") +
geom_hline(yintercept = 0, color = "red", linewidth = 1) +
geom_smooth(method = "loess", se = FALSE,
color = "darkblue", linewidth = 0.8) +
labs(
title = "Residuals vs Fitted Values",
x = "Fitted Values", y = "Residuals"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
# (c) Create a normal Q-Q plot of residuals using ggplot
ggplot(diag_data, aes(sample = .resid)) +
stat_qq(alpha = 0.2, color = "steelblue") +
stat_qq_line(color = "red", linewidth = 1) +
labs(
title = "Normal Q-Q Plot of Residuals",
x = "Theoretical Quantiles", y = "Sample Quantiles"
) +
theme_minimal()
# (d) Create a Cook's distance plot
ggplot(diag_data, aes(x = seq_along(.cooksd), y = .cooksd)) +
geom_col(
fill = ifelse(diag_data$.cooksd > 4/nrow(diag_data), "red", "steelblue"),
alpha = 0.7) +
geom_hline(yintercept = 4/nrow(diag_data), color = "red",
linetype = "dashed", linewidth = 1) +
labs(
title = "Cook's Distance",
subtitle = paste0("Threshold = 4/n = ", round(4/nrow(diag_data), 5)),
x = "Observation Index", y = "Cook's Distance"
) +
theme_minimal()
# Count influential points
n_influential <- sum(diag_data$.cooksd > 4/nrow(diag_data))
cat("Influential observations (Cook's D > 4/n):", n_influential, "\n")
## Influential observations (Cook's D > 4/n): 136
Questions:
Examine the Residuals vs. Fitted plot. Is there evidence of nonlinearity or heteroscedasticity? Describe what you see. Residuals vs Fitted shows a fan/megaphone pattern (heteroscedasticity) and a band at the ceiling (phys_days = 30). Non-constant variance is evident.
Examine the Q-Q plot. Are the residuals
approximately normal? What do departures from normality in this context
suggest about the distribution of phys_days? Ans:Q-Q plot
shows heavy right-tail departures — residuals are NOT normally
distributed. Expected given the skewed, bounded nature of
phys_days.
Are there any influential observations (Cook’s D > 4/n)? How many? What would you do about them? Ans:136 influential observations exceed 4/n. These are largely people at the max (30 days). Do not delete — instead consider a better model (e.g., Tobit, negative binomial).
Overall, do the LINE assumptions appear to be met? Which assumption(s) may be most problematic for this model, and why? (Hint: think about the nature of the outcome variable.) Ans:L — weakly met. I — met. N — VIOLATED (right skew). E — VIOLATED (heteroscedasticity). phys_days as a bounded count outcome is the root cause of most violations —
Now fit a second SLR model using age as the
predictor of phys_days instead of BMI.
# (a) Fit SLR: phys_days ~ age
model_age <- lm(phys_days ~ age, data = brfss_full)
summary(model_age)
##
## Call:
## lm(formula = phys_days ~ age, data = brfss_full)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.872 -8.803 -4.733 9.460 23.267
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.37023 0.66608 6.561 6.27e-11 ***
## age 0.13127 0.01145 11.467 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.92 on 2998 degrees of freedom
## Multiple R-squared: 0.04202, Adjusted R-squared: 0.0417
## F-statistic: 131.5 on 1 and 2998 DF, p-value: < 2.2e-16
# (b) Display results and compare to the BMI model
tidy(model_age, conf.int = TRUE)
## # A tibble: 2 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 4.37 0.666 6.56 6.27e-11 3.06 5.68
## 2 age 0.131 0.0114 11.5 7.97e-30 0.109 0.154
cat("\n--- BMI Model ---\n")
##
## --- BMI Model ---
print(tidy(model_slr, conf.int = TRUE))
## # A tibble: 2 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 7.42 0.869 8.54 2.04e-17 5.72 9.13
## 2 bmi 0.145 0.0289 5.01 5.66e- 7 0.0884 0.202
cat("\n--- Age Model ---\n")
##
## --- Age Model ---
print(tidy(model_age, conf.int = TRUE))
## # A tibble: 2 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 4.37 0.666 6.56 6.27e-11 3.06 5.68
## 2 age 0.131 0.0114 11.5 7.97e-30 0.109 0.154
# (c) Which predictor has the stronger association? Compare R² values.
R2_bmi <- summary(model_slr)$r.squared
R2_age <- summary(model_age)$r.squared
cat("R² (BMI model):", round(R2_bmi, 6), "\n")
## R² (BMI model): 0.008314
cat("R² (Age model):", round(R2_age, 6), "\n")
## R² (Age model): 0.042021
Questions:
How does the association between age and poor physical health days compare to the BMI association in terms of direction, magnitude, and statistical significance? Ans:Age has a positive association (slope = +0.131, p = 7.97e-30). Older people report more poor health days. Age is more strongly associated (t = 11.47) than BMI (t = 5.01).
Compare the \(R^2\) values of
the two models. Which predictor explains more variability in
phys_days? Ans:R²(age) = 4.20% vs R²(bmi) = 0.83%. Age is
the stronger predictor, but both explain trivially small proportions of
variance.
Based on these two simple models, what is your overall conclusion about predictors of poor physical health days? What are the limitations of using simple linear regression for this outcome? Ans:Both predictors are significant but weak individually. Poor physical health is a complex, multidetermined outcome. Key SLR limitations: violated normality/equal variance, bounded outcome (1–30 days), single predictors ignore confounding. Multiple regression or count-data models would be more appropriate. —
Submit your completed .Rmd file and the RPubs
link to your knitted HTML document.
Your .Rmd must knit without errors. Make sure all code
chunks produce visible output and all questions are answered in complete
sentences below each code chunk.
Due: Before the next class session.
sessionInfo()
## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] gtsummary_2.5.0 ggeffects_2.3.2 broom_1.0.12 plotly_4.12.0
## [5] kableExtra_1.4.0 knitr_1.51 here_1.0.2 haven_2.5.5
## [9] lubridate_1.9.4 forcats_1.0.1 stringr_1.6.0 dplyr_1.1.4
## [13] purrr_1.2.1 readr_2.1.6 tidyr_1.3.2 tibble_3.3.1
## [17] ggplot2_4.0.2 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 xfun_0.56 bslib_0.10.0 htmlwidgets_1.6.4
## [5] insight_1.4.6 lattice_0.22-7 tzdb_0.5.0 vctrs_0.7.1
## [9] tools_4.5.1 generics_0.1.4 pkgconfig_2.0.3 Matrix_1.7-3
## [13] data.table_1.18.0 RColorBrewer_1.1-3 S7_0.2.1 lifecycle_1.0.5
## [17] compiler_4.5.1 farver_2.1.2 textshaping_1.0.4 janitor_2.2.1
## [21] snakecase_0.11.1 htmltools_0.5.9 sass_0.4.10 yaml_2.3.12
## [25] lazyeval_0.2.2 pillar_1.11.1 jquerylib_0.1.4 cachem_1.1.0
## [29] nlme_3.1-168 tidyselect_1.2.1 digest_0.6.39 stringi_1.8.7
## [33] labeling_0.4.3 splines_4.5.1 rprojroot_2.1.1 fastmap_1.2.0
## [37] grid_4.5.1 cli_3.6.5 magrittr_2.0.4 utf8_1.2.6
## [41] withr_3.0.2 scales_1.4.0 backports_1.5.0 timechange_0.3.0
## [45] rmarkdown_2.30 httr_1.4.7 otel_0.2.0 hms_1.1.4
## [49] evaluate_1.0.5 viridisLite_0.4.2 mgcv_1.9-3 rlang_1.1.7
## [53] glue_1.8.0 xml2_1.5.2 svglite_2.2.2 rstudioapi_0.18.0
## [57] jsonlite_2.0.0 R6_2.6.1 systemfonts_1.3.1