R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot. # Setup and Data

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.5.2

## Warning: package 'tibble' was built under R version 4.5.2

## Warning: package 'tidyr' was built under R version 4.5.2

## Warning: package 'readr' was built under R version 4.5.2

## Warning: package 'purrr' was built under R version 4.5.2

## Warning: package 'dplyr' was built under R version 4.5.2

## Warning: package 'stringr' was built under R version 4.5.2

## Warning: package 'forcats' was built under R version 4.5.2

## Warning: package 'lubridate' was built under R version 4.5.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(haven)

## Warning: package 'haven' was built under R version 4.5.2

library(here)

## Warning: package 'here' was built under R version 4.5.2

## here() starts at C:/Users/safwa/OneDrive - University at Albany - SUNY/EPI 553/Labs

library(knitr)

## Warning: package 'knitr' was built under R version 4.5.2

library(kableExtra)

## Warning: package 'kableExtra' was built under R version 4.5.2

## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows

library(plotly)

## Warning: package 'plotly' was built under R version 4.5.2

## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout

library(broom)
library(ggeffects)

## Warning: package 'ggeffects' was built under R version 4.5.2

library(gtsummary)

## Warning: package 'gtsummary' was built under R version 4.5.2

Loading BRFSS 2020 Data

# Load raw BRFSS 2020 data
brfss_full <- read_rds("C:\\Users\\safwa\\OneDrive - University at Albany - SUNY\\EPI 553\\Labs\\brfss_slr_2020.rds") %>%
janitor::clean_names()

# Display dataset dimensions
names(brfss_full)

## [1] "bmi"            "age"            "sex"            "education"     
## [5] "gen_health_num" "sleep_hrs"      "phys_days"

Task 1: Explore the Variables (15 points)

# (a) Create a summary table of phys_days and bmi
summary(brfss_full[, c("phys_days", "bmi")])

##    phys_days          bmi       
##  Min.   : 1.00   Min.   :14.63  
##  1st Qu.: 2.00   1st Qu.:24.32  
##  Median : 6.00   Median :27.89  
##  Mean   :11.66   Mean   :29.18  
##  3rd Qu.:20.00   3rd Qu.:32.89  
##  Max.   :30.00   Max.   :59.60

# (b) Create a histogram of phys_days — describe the distribution

ggplot(brfss_full, aes(x = phys_days)) +
  geom_histogram(fill = "steelblue", color = "white", bins = 30) +
  labs(title = "Histogram of Physical Days", x = "Physical Days", y = "Frequency")

# (c) Create a scatter plot of phys_days (Y) vs bmi (X)

ggplot(brfss_full, aes(x = bmi, y = phys_days)) +
  geom_point(color = "steelblue", alpha = 0.6) +
  labs(title = "Physical Days vs BMI", x = "BMI", y = "Physical Days")

Questions:

What is the mean and standard deviation of phys_days? Of bmi? What do you notice about the distribution of phys_days? Ans: Mean phys_days = 11.66, SD = 11.16. Mean bmi = 29.18, SD = 7.01.phys_days is strongly right-skewed. The median (6) is far below the mean (11.66). Most respondents report few poor health days, but a long tail extends to 30 days. This non-normality is a key concern for OLS assumptions.
Based on the scatter plot, does the relationship between BMI and poor physical health days appear to be linear? Are there any obvious outliers? Ans : The relationship is very weakly positive with enormous scatter. No strong linear pattern is visually evident. Many high-residual points at phys_days = 30 (ceiling) exist across all BMI levels.

Task 2: Fit and Interpret the SLR Model (20 points)

# (a) Fit the SLR model: phys_days ~ bmi
model_slr <- lm(phys_days ~ bmi, data = brfss_full)

# Summary output
summary(model_slr)

## 
## Call:
## lm(formula = phys_days ~ bmi, data = brfss_full)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.808  -9.160  -5.623   8.943  20.453 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.42285    0.86881   8.544  < 2e-16 ***
## bmi          0.14513    0.02895   5.013 5.66e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.12 on 2998 degrees of freedom
## Multiple R-squared:  0.008314,   Adjusted R-squared:  0.007983 
## F-statistic: 25.13 on 1 and 2998 DF,  p-value: 5.659e-07

# (b) Display a tidy coefficient table with 95% CIs
tidy(model_slr, conf.int = TRUE, conf.level = 0.95)

## # A tibble: 2 × 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)    7.42     0.869       8.54 2.04e-17   5.72       9.13 
## 2 bmi            0.145    0.0289      5.01 5.66e- 7   0.0884     0.202

# (c) Extract and report: slope, intercept, t-statistic, p-value
coef_table <- tidy(model_slr, conf.int = TRUE)
intercept   <- coef_table$estimate[1]   # b0 = 7.4228
slope       <- coef_table$estimate[2]   # b1 = 0.1451
t_slope     <- coef_table$statistic[2]  # t  = 5.01
p_slope     <- coef_table$p.value[2]    # p  = 5.66e-07

Questions:

Write the fitted regression equation in the form \(\hat{Y} = b_0 + b_1 X\). Ans:Fitted equation: phys_days_hat = 7.4228 + 0.1451 × BMI
Interpret the slope (\(b_1\)) in context — what does it mean in plain English? Ans:Slope = 0.1451: For each 1-unit increase in BMI, expected poor physical health days increases by ~0.145 days, on average. Positive but practically small effect.
Is the intercept (\(b_0\)) interpretable in this context? Why or why not? Ans:Intercept (7.42) is NOT interpretable — a BMI of 0 is physiologically impossible. It serves only as a mathematical anchor.
Is the association statistically significant at \(\alpha = 0.05\)? State the null hypothesis, test statistic, and p-value. Ans:H₀: β₁ = 0. t = 5.01, p = 5.66×10⁻⁷ < 0.05 → Reject H₀. Statistically significant association. —

Task 3: ANOVA Decomposition and R² (15 points)

# (a) Display the ANOVA table
anova_slr <- anova(model_slr)

anova_slr %>%
  kable(
    caption = "ANOVA Table: phys_days ~ bmi",
    digits = 3,
    col.names = c("Source", "Df", "Sum Sq", "Mean Sq", "F value", "Pr(>F)")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

ANOVA Table: phys_days ~ bmi
Source	Df	Sum Sq	Mean Sq	F value	Pr(>F)
bmi	1	3105.365	3105.365	25.134	0
Residuals	2998	370411.743	123.553	NA	NA

# (b) Compute and report SSTotal, SSRegression, and SSResidual
y     <- brfss_full$phys_days[!is.na(brfss_full$phys_days) & !is.na(brfss_full$bmi)]
y_hat <- fitted(model_slr)
y_bar <- mean(y)
 
SS_Total      <- sum((y - y_bar)^2)
SS_Regression <- sum((y_hat - y_bar)^2)
SS_Residual   <- sum((y - y_hat)^2)
 
cat("SS_Total      =", round(SS_Total, 4), "\n")

## SS_Total      = 373517.1

cat("SS_Regression =", round(SS_Regression, 4), "\n")

## SS_Regression = 3105.365

cat("SS_Residual   =", round(SS_Residual, 4), "\n")

## SS_Residual   = 370411.7

# (c) Compute R² two ways: from the model object and from the SS decomposition
R2_from_model <- summary(model_slr)$r.squared
R2_from_SS    <- SS_Regression / SS_Total
cat("R² from model:", round(R2_from_model, 6), "\n")

## R² from model: 0.008314

cat("R² from SS:   ", round(R2_from_SS, 6), "\n")

## R² from SS:    0.008314

Questions:

Fill in the ANOVA table components: \(SS_{Total}\), \(SS_{Regression}\), \(SS_{Residual}\), \(df\), and \(F\)-statistic. Ans:SS_Total = 373,517.11 | SS_Regression = 3,105.36 | SS_Residual = 370,411.74 df_reg = 1 | df_res = 2998 | F = 25.13 R² = 0.0083 (both methods agree)
What is the \(R^2\) value? Interpret it in plain English. Ans: R² = 0.0083 — BMI explains only 0.83% of the variability in phys_days. While statistically significant, BMI is a very poor standalone predictor.
What does this \(R^2\) tell you about how well BMI alone explains variation in poor physical health days? What might explain the remaining variation? Ans:The remaining 99.17% of variation is attributable to other factors: chronic illness, age, mental health, SES, healthcare access, etc. —

Task 4: Confidence and Prediction Intervals (20 points)

# (a) Calculate the fitted BMI value and 95% CI for a person with BMI = 25
new_data <- data.frame(bmi = 25)
 
ci_bmi25 <- predict(model_slr, newdata = new_data,
                    interval = "confidence", level = 0.95)
print(ci_bmi25)

##        fit      lwr      upr
## 1 11.05108 10.58774 11.51441

# (b) Calculate the 95% prediction interval for a person with BMI = 25
pi_bmi25 <- predict(model_slr, newdata = new_data,
                    interval = "prediction", level = 0.95)
print(pi_bmi25)

##        fit      lwr      upr
## 1 11.05108 -10.7485 32.85066

# (c) Plot the regression line with both the CI band and PI band
bmi_seq <- data.frame(bmi = seq(min(brfss_full$bmi, na.rm=TRUE),
                                  max(brfss_full$bmi, na.rm=TRUE), length.out = 300))
ci_band <- predict(model_slr, newdata = bmi_seq, interval = "confidence")
pi_band <- predict(model_slr, newdata = bmi_seq, interval = "prediction")
 
plot_data <- data.frame(
  bmi    = bmi_seq$bmi,
  fit    = ci_band[,"fit"],
  ci_lwr = ci_band[,"lwr"],
  ci_upr = ci_band[,"upr"],
  pi_lwr = pi_band[,"lwr"],
  pi_upr = pi_band[,"upr"]
)
 
ggplot() +
  geom_point(data =  brfss_full, aes(x = bmi, y = phys_days),
             alpha = 0.15, size = 1.2, color = "steelblue") +
  geom_ribbon(data = plot_data, aes(x = bmi, ymin = pi_lwr, ymax = pi_upr),
              fill = "orange", alpha = 0.20) +
  geom_ribbon(data = plot_data, aes(x = bmi, ymin = ci_lwr, ymax = ci_upr),
              fill = "red", alpha = 0.35) +
  geom_line(data = plot_data, aes(x = bmi, y = fit),
            color = "red", size = 1.2) +
  geom_vline(xintercept = 25, linetype = "dashed",
             color = "darkgreen", alpha = 0.7) +
  labs(
    title    = "SLR: phys_days ~ BMI with 95% CI and PI Bands",
    subtitle = "Red band = 95% CI (mean), Orange band = 95% PI (individual)",
    x = "BMI", y = "Poor Physical Health Days"
  ) +
  theme_minimal()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Questions:

For someone with a BMI of 25, what is the estimated mean number of poor physical health days? What is the 95% confidence interval for this mean? Ans: Mean phys_days at BMI=25 = 11.05 days, 95% CI: (10.59, 11.51).
If a specific new person has a BMI of 25, what is the 95% prediction interval for their number of poor physical health days? Ans: Individual PI at BMI=25: (−10.75, 32.85). The negative lower bound is impossible — reflecting serious model assumption violations.
Explain in your own words why the prediction interval is wider than the confidence interval. When would you use each one in practice? Ans; PI is wider because it includes both uncertainty about the mean (CI) AND natural individual variation. Use CI when estimating average group outcomes; use PI when predicting a specific individual. —

Task 5: Residual Diagnostics (20 points)

# (a) Produce the four standard diagnostic plots (use par(mfrow = c(2,2)) and plot())
par(mfrow = c(2, 2))
plot(model_slr)

par(mfrow = c(1, 1))


# (b) Create a residuals vs. fitted plot using ggplot
diag_data <- augment(model_slr)  # broom::augment
 
ggplot(diag_data, aes(x = .fitted, y = .resid)) +
  geom_point(alpha = 0.2, size = 1.5, color = "steelblue") +
  geom_hline(yintercept = 0, color = "red", linewidth = 1) +
  geom_smooth(method = "loess", se = FALSE,
              color = "darkblue", linewidth = 0.8) +
  labs(
    title = "Residuals vs Fitted Values",
    x = "Fitted Values", y = "Residuals"
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

# (c) Create a normal Q-Q plot of residuals using ggplot
ggplot(diag_data, aes(sample = .resid)) +
  stat_qq(alpha = 0.2, color = "steelblue") +
  stat_qq_line(color = "red", linewidth = 1) +
  labs(
    title = "Normal Q-Q Plot of Residuals",
    x = "Theoretical Quantiles", y = "Sample Quantiles"
  ) +
  theme_minimal()

# (d) Create a Cook's distance plot
ggplot(diag_data, aes(x = seq_along(.cooksd), y = .cooksd)) +
  geom_col(
    fill = ifelse(diag_data$.cooksd > 4/nrow(diag_data), "red", "steelblue"),
    alpha = 0.7) +
  geom_hline(yintercept = 4/nrow(diag_data), color = "red",
             linetype = "dashed", linewidth = 1) +
  labs(
    title    = "Cook's Distance",
    subtitle = paste0("Threshold = 4/n = ", round(4/nrow(diag_data), 5)),
    x = "Observation Index", y = "Cook's Distance"
  ) +
  theme_minimal()

# Count influential points
n_influential <- sum(diag_data$.cooksd > 4/nrow(diag_data))
cat("Influential observations (Cook's D > 4/n):", n_influential, "\n")

## Influential observations (Cook's D > 4/n): 136

Questions:

Examine the Residuals vs. Fitted plot. Is there evidence of nonlinearity or heteroscedasticity? Describe what you see. Residuals vs Fitted shows a fan/megaphone pattern (heteroscedasticity) and a band at the ceiling (phys_days = 30). Non-constant variance is evident.
Examine the Q-Q plot. Are the residuals approximately normal? What do departures from normality in this context suggest about the distribution of phys_days? Ans:Q-Q plot shows heavy right-tail departures — residuals are NOT normally distributed. Expected given the skewed, bounded nature of phys_days.
Are there any influential observations (Cook’s D > 4/n)? How many? What would you do about them? Ans:136 influential observations exceed 4/n. These are largely people at the max (30 days). Do not delete — instead consider a better model (e.g., Tobit, negative binomial).
Overall, do the LINE assumptions appear to be met? Which assumption(s) may be most problematic for this model, and why? (Hint: think about the nature of the outcome variable.) Ans:L — weakly met. I — met. N — VIOLATED (right skew). E — VIOLATED (heteroscedasticity). phys_days as a bounded count outcome is the root cause of most violations —

Task 6: Testing a Different Predictor (10 points)

Now fit a second SLR model using age as the predictor of phys_days instead of BMI.

# (a) Fit SLR: phys_days ~ age
model_age <- lm(phys_days ~ age, data = brfss_full)
summary(model_age)

## 
## Call:
## lm(formula = phys_days ~ age, data = brfss_full)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.872  -8.803  -4.733   9.460  23.267 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.37023    0.66608   6.561 6.27e-11 ***
## age          0.13127    0.01145  11.467  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.92 on 2998 degrees of freedom
## Multiple R-squared:  0.04202,    Adjusted R-squared:  0.0417 
## F-statistic: 131.5 on 1 and 2998 DF,  p-value: < 2.2e-16

# (b) Display results and compare to the BMI model

tidy(model_age, conf.int = TRUE)

## # A tibble: 2 × 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)    4.37     0.666       6.56 6.27e-11    3.06      5.68 
## 2 age            0.131    0.0114     11.5  7.97e-30    0.109     0.154

cat("\n--- BMI Model ---\n")

## 
## --- BMI Model ---

print(tidy(model_slr, conf.int = TRUE))

## # A tibble: 2 × 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)    7.42     0.869       8.54 2.04e-17   5.72       9.13 
## 2 bmi            0.145    0.0289      5.01 5.66e- 7   0.0884     0.202

cat("\n--- Age Model ---\n")

## 
## --- Age Model ---

print(tidy(model_age, conf.int = TRUE))

## # A tibble: 2 × 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)    4.37     0.666       6.56 6.27e-11    3.06      5.68 
## 2 age            0.131    0.0114     11.5  7.97e-30    0.109     0.154

# (c) Which predictor has the stronger association? Compare R² values.
R2_bmi <- summary(model_slr)$r.squared
R2_age <- summary(model_age)$r.squared
cat("R² (BMI model):", round(R2_bmi, 6), "\n")

## R² (BMI model): 0.008314

cat("R² (Age model):", round(R2_age, 6), "\n")

## R² (Age model): 0.042021

Questions:

How does the association between age and poor physical health days compare to the BMI association in terms of direction, magnitude, and statistical significance? Ans:Age has a positive association (slope = +0.131, p = 7.97e-30). Older people report more poor health days. Age is more strongly associated (t = 11.47) than BMI (t = 5.01).
Compare the \(R^2\) values of the two models. Which predictor explains more variability in phys_days? Ans:R²(age) = 4.20% vs R²(bmi) = 0.83%. Age is the stronger predictor, but both explain trivially small proportions of variance.
Based on these two simple models, what is your overall conclusion about predictors of poor physical health days? What are the limitations of using simple linear regression for this outcome? Ans:Both predictors are significant but weak individually. Poor physical health is a complex, multidetermined outcome. Key SLR limitations: violated normality/equal variance, bounded outcome (1–30 days), single predictors ignore confounding. Multiple regression or count-data models would be more appropriate. —

Submission Instructions

Submit your completed .Rmd file and the RPubs link to your knitted HTML document.

Your .Rmd must knit without errors. Make sure all code chunks produce visible output and all questions are answered in complete sentences below each code chunk.

Due: Before the next class session.

Session Info

sessionInfo()

## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] gtsummary_2.5.0  ggeffects_2.3.2  broom_1.0.12     plotly_4.12.0   
##  [5] kableExtra_1.4.0 knitr_1.51       here_1.0.2       haven_2.5.5     
##  [9] lubridate_1.9.4  forcats_1.0.1    stringr_1.6.0    dplyr_1.1.4     
## [13] purrr_1.2.1      readr_2.1.6      tidyr_1.3.2      tibble_3.3.1    
## [17] ggplot2_4.0.2    tidyverse_2.0.0 
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6       xfun_0.56          bslib_0.10.0       htmlwidgets_1.6.4 
##  [5] insight_1.4.6      lattice_0.22-7     tzdb_0.5.0         vctrs_0.7.1       
##  [9] tools_4.5.1        generics_0.1.4     pkgconfig_2.0.3    Matrix_1.7-3      
## [13] data.table_1.18.0  RColorBrewer_1.1-3 S7_0.2.1           lifecycle_1.0.5   
## [17] compiler_4.5.1     farver_2.1.2       textshaping_1.0.4  janitor_2.2.1     
## [21] snakecase_0.11.1   htmltools_0.5.9    sass_0.4.10        yaml_2.3.12       
## [25] lazyeval_0.2.2     pillar_1.11.1      jquerylib_0.1.4    cachem_1.1.0      
## [29] nlme_3.1-168       tidyselect_1.2.1   digest_0.6.39      stringi_1.8.7     
## [33] labeling_0.4.3     splines_4.5.1      rprojroot_2.1.1    fastmap_1.2.0     
## [37] grid_4.5.1         cli_3.6.5          magrittr_2.0.4     utf8_1.2.6        
## [41] withr_3.0.2        scales_1.4.0       backports_1.5.0    timechange_0.3.0  
## [45] rmarkdown_2.30     httr_1.4.7         otel_0.2.0         hms_1.1.4         
## [49] evaluate_1.0.5     viridisLite_0.4.2  mgcv_1.9-3         rlang_1.1.7       
## [53] glue_1.8.0         xml2_1.5.2         svglite_2.2.2      rstudioapi_0.18.0 
## [57] jsonlite_2.0.0     R6_2.6.1           systemfonts_1.3.1

Simple Linear Regression

Ummat Safwat Sristy

2026-02-26

R Markdown

Including Plots

Loading BRFSS 2020 Data

Task 1: Explore the Variables (15 points)

Task 2: Fit and Interpret the SLR Model (20 points)

Task 3: ANOVA Decomposition and R² (15 points)

Task 4: Confidence and Prediction Intervals (20 points)

Task 5: Residual Diagnostics (20 points)

Task 6: Testing a Different Predictor (10 points)

Submission Instructions

Session Info