Multiple Linear Regression

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

# Load the dataset
library(dplyr)

## Warning: package 'dplyr' was built under R version 4.5.2

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.5.2

## Warning: package 'tibble' was built under R version 4.5.2

## Warning: package 'tidyr' was built under R version 4.5.2

## Warning: package 'readr' was built under R version 4.5.2

## Warning: package 'purrr' was built under R version 4.5.2

## Warning: package 'stringr' was built under R version 4.5.2

## Warning: package 'forcats' was built under R version 4.5.2

## Warning: package 'lubridate' was built under R version 4.5.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.1     ✔ readr     2.1.6
## ✔ ggplot2   4.0.2     ✔ stringr   1.6.0
## ✔ lubridate 1.9.4     ✔ tibble    3.3.1
## ✔ purrr     1.2.1     ✔ tidyr     1.3.2

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(haven)

## Warning: package 'haven' was built under R version 4.5.2

library(janitor)

## Warning: package 'janitor' was built under R version 4.5.2

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(knitr)

## Warning: package 'knitr' was built under R version 4.5.2

library(kableExtra)

## Warning: package 'kableExtra' was built under R version 4.5.2

## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows

library(plotly)

## Warning: package 'plotly' was built under R version 4.5.2

## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout

library(broom)
library(ggeffects)

## Warning: package 'ggeffects' was built under R version 4.5.2

library(gtsummary)

## Warning: package 'gtsummary' was built under R version 4.5.2

library(GGally)

## Warning: package 'GGally' was built under R version 4.5.2

library(car)

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.5.2

## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:purrr':
## 
##     some
## 
## The following object is masked from 'package:dplyr':
## 
##     recode

library(lmtest)

## Warning: package 'lmtest' was built under R version 4.5.2

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 4.5.2

## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(corrplot)

## Warning: package 'corrplot' was built under R version 4.5.2

## corrplot 0.95 loaded

brfss_full <- read_xpt("C:\\Users\\safwa\\OneDrive - University at Albany - SUNY\\EPI 553\\Labs\\brfss dataset\\LLCP2020.XPT") %>%
clean_names()

# Display dataset dimensions
names(brfss_full)

##   [1] "state"    "fmonth"   "idate"    "imonth"   "iday"     "iyear"   
##   [7] "dispcode" "seqno"    "psu"      "ctelenm1" "pvtresd1" "colghous"
##  [13] "statere1" "celphone" "ladult1"  "colgsex"  "numadult" "landsex" 
##  [19] "nummen"   "numwomen" "respslct" "safetime" "ctelnum1" "cellfon5"
##  [25] "cadult1"  "cellsex"  "pvtresd3" "cclghous" "cstate1"  "landline"
##  [31] "hhadult"  "sexvar"   "genhlth"  "physhlth" "menthlth" "poorhlth"
##  [37] "hlthpln1" "persdoc2" "medcost"  "checkup1" "exerany2" "sleptim1"
##  [43] "cvdinfr4" "cvdcrhd4" "cvdstrk3" "asthma3"  "asthnow"  "chcscncr"
##  [49] "chcocncr" "chccopd2" "havarth4" "addepev3" "chckdny2" "diabete4"
##  [55] "diabage3" "lastden4" "rmvteth4" "marital"  "educa"    "renthom1"
##  [61] "numhhol3" "numphon3" "cpdemo1b" "veteran3" "employ1"  "children"
##  [67] "income2"  "pregnant" "weight2"  "height3"  "deaf"     "blind"   
##  [73] "decide"   "diffwalk" "diffdres" "diffalon" "smoke100" "smokday2"
##  [79] "stopsmk2" "lastsmk2" "usenow3"  "alcday5"  "avedrnk3" "drnk3ge5"
##  [85] "maxdrnks" "flushot7" "flshtmy3" "shingle2" "pneuvac4" "fall12mn"
##  [91] "fallinj4" "seatbelt" "drnkdri2" "hadmam"   "howlong"  "hadpap2" 
##  [97] "lastpap2" "hpvtest"  "hplsttst" "hadhyst2" "pcpsaad3" "pcpsadi1"
## [103] "pcpsare1" "psatest1" "psatime"  "pcpsars1" "colnscpy" "colntest"
## [109] "sigmscpy" "sigmtest" "bldstol1" "lstblds4" "stooldna" "sdnatest"
## [115] "vircolon" "vclntest" "hivtst7"  "hivtstd3" "hivrisk5" "pdiabtst"
## [121] "prediab1" "insulin1" "bldsugar" "feetchk3" "doctdiab" "chkhemo3"
## [127] "feetchk"  "eyeexam1" "diabeye"  "diabedu"  "toldcfs"  "havecfs" 
## [133] "workcfs"  "toldhepc" "trethepc" "prirhepc" "havehepc" "havehepb"
## [139] "medshepb" "hlthcvr1" "cimemlos" "cdhouse"  "cdassist" "cdhelp"  
## [145] "cdsocial" "cddiscus" "caregiv1" "crgvrel4" "crgvlng1" "crgvhrs1"
## [151] "crgvprb3" "crgvalzd" "crgvper1" "crgvhou1" "crgvexpt" "ecigaret"
## [157] "ecignow"  "marijan1" "usemrjn2" "rsnmrjn1" "lcsfirst" "lcslast" 
## [163] "lcsnumcg" "lcsctscn" "cncrdiff" "cncrage"  "cncrtyp1" "csrvtrt3"
## [169] "csrvdoc1" "csrvsum"  "csrvrtrn" "csrvinst" "csrvinsr" "csrvdein"
## [175] "csrvclin" "csrvpain" "csrvctl2" "pcpsade1" "pcdmdec1" "hpvadvc4"
## [181] "hpvadsht" "tetanus1" "imfvpla1" "birthsex" "somale"   "sofemale"
## [187] "trnsgndr" "acedeprs" "acedrink" "acedrugs" "aceprisn" "acedivrc"
## [193] "acepunch" "acehurt1" "aceswear" "acetouch" "acetthem" "acehvsex"
## [199] "rcsgendr" "rcsrltn2" "casthdx2" "casthno2" "qstver"   "qstlang" 
## [205] "metstat"  "urbstat"  "mscode"   "ststr"    "strwt"    "rawrake" 
## [211] "wt2rake"  "imprace"  "chispnc"  "crace1"   "cprace"   "cllcpwt" 
## [217] "dualuse"  "dualcor"  "llcpwt2"  "llcpwt"   "rfhlth"   "phys14d" 
## [223] "ment14d"  "hcvu651"  "totinda"  "michd"    "ltasth1"  "casthm1" 
## [229] "asthms1"  "drdxar2"  "exteth3"  "alteth3"  "denvst3"  "prace1"  
## [235] "mrace1"   "hispanc"  "race"     "raceg21"  "racegr3"  "raceprv" 
## [241] "sex"      "ageg5yr"  "age65yr"  "age80"    "age_g"    "htin4"   
## [247] "htm4"     "wtkg3"    "bmi5"     "bmi5cat"  "rfbmi5"   "chldcnt" 
## [253] "educag"   "incomg"   "smoker3"  "rfsmok3"  "drnkany5" "drocdy3" 
## [259] "rfbing5"  "drnkwk1"  "rfdrhv7"  "flshot7"  "pneumo3"  "rfseat2" 
## [265] "rfseat3"  "drnkdrv"  "rfmam22"  "mam5023"  "rfpap35"  "rfpsa23" 
## [271] "clnscpy"  "sgmscpy"  "sgms10y"  "rfblds4"  "stoldna"  "vircoln" 
## [277] "sbontim"  "crcrec1"  "aidtst4"

#  Recode and clean 
brfss_mlr <- brfss_full%>%
  mutate(
    # Outcome: mentally unhealthy days in past 30 (88 = none = 0; 77/99 = DK/refused = NA)
    menthlth_days = case_when(
      menthlth == 88              ~ 0,
      menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth),
      TRUE                        ~ NA_real_
    ),
    # Physical health days (key predictor)
    physhlth_days = case_when(
      physhlth == 88              ~ 0,
      physhlth >= 1 & physhlth <= 30 ~ as.numeric(physhlth),
      TRUE                        ~ NA_real_
    ),
    # Sleep hours (practical cap at 14)
    sleep_hrs = case_when(
      sleptim1 >= 1 & sleptim1 <= 14 ~ as.numeric(sleptim1),
      TRUE                        ~ NA_real_
    ),
    # Age (capped at 80 per BRFSS coding)
    age = age80,
    # Income category (ordinal 1–8)
    income_cat = case_when(
      income2 %in% 1:8 ~ as.numeric(income2),
      TRUE             ~ NA_real_
    ),
    # Sex
    sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
    # Exercise in past 30 days (any physical activity)
    exercise = factor(case_when(
      exerany2 == 1 ~ "Yes",
      exerany2 == 2 ~ "No",
      TRUE          ~ NA_character_
    ), levels = c("No", "Yes")),
    # BMI (stored as integer × 100 in BRFSS)
    bmi = ifelse(bmi5 > 0, bmi5 / 100, NA_real_),
    # Income as labeled factor (for display)
    income_f = factor(income2, levels = 1:8,
      labels = c("<$10k", "$10-15k", "$15-20k", "$20-25k",
                 "$25-35k", "$35-50k", "$50-75k", ">$75k"))
  ) %>%
  filter(
    !is.na(menthlth_days),
    !is.na(physhlth_days),
    !is.na(sleep_hrs),
    !is.na(age),   age >= 18,
    !is.na(income_cat),
    !is.na(sex),
    !is.na(exercise)
  )

# ── Analytic sample (reproducible random sample) ────────────────────────────
set.seed(553)
brfss_mlr <- brfss_mlr %>%
  select(menthlth_days, physhlth_days, sleep_hrs, age,
         income_cat, income_f, sex, exercise, bmi) %>%
  slice_sample(n = 5000)

# Save for lab activity
saveRDS(brfss_mlr,
  "C:\\Users\\safwa\\OneDrive - University at Albany - SUNY\\EPI 553\\Labs\\brfss dataset\\brfss_mlr_2020.rds")

tibble(Metric = c("Observations", "Variables"),
       Value  = c(nrow(brfss_mlr), ncol(brfss_mlr))) %>%
  kable(caption = "Analytic Dataset Dimensions") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Analytic Dataset Dimensions
Metric	Value
Observations	5000
Variables	9

Task 1: Exploratory Data Analysis (15 points)

1a. (5 pts) Create a descriptive statistics table using tbl_summary() that includes all variables in the dataset. Include means (SD) for continuous variables and n (%) for categorical variables.

brfss_mlr %>%
  select(menthlth_days, physhlth_days, sleep_hrs, age,
         income_f, sex, exercise) %>%
  tbl_summary(
    label = list(
      menthlth_days ~ "Mentally unhealthy days (past 30)",
      physhlth_days ~ "Physically unhealthy days (past 30)",
      sleep_hrs     ~ "Sleep (hours/night)",
      age           ~ "Age (years)",
      income_f      ~ "Household income",
      sex           ~ "Sex",
      exercise      ~ "Any physical activity (past 30 days)"
    ),
    statistic = list(
      all_continuous()  ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1,
    missing = "no"
  ) %>%
  add_n() %>%
  bold_labels() %>%
  modify_caption("**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**")

**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**
Characteristic	N	N = 5,000¹
Mentally unhealthy days (past 30)	5,000	3.8 (7.7)
Physically unhealthy days (past 30)	5,000	3.3 (7.8)
Sleep (hours/night)	5,000	7.1 (1.3)
Age (years)	5,000	54.3 (17.2)
Household income	5,000
<$10k		190 (3.8%)
$10-15k		169 (3.4%)
$15-20k		312 (6.2%)
$20-25k		434 (8.7%)
$25-35k		489 (9.8%)
$35-50k		683 (14%)
$50-75k		841 (17%)
>$75k		1,882 (38%)
Sex	5,000
Male		2,331 (47%)
Female		2,669 (53%)
Any physical activity (past 30 days)	5,000	3,874 (77%)
¹ Mean (SD); n (%)

1b. (5 pts) Create a histogram of menthlth_days. Describe the shape of the distribution. Is it symmetric, right-skewed, or left-skewed? What are the implications of this shape for regression modeling?

p_hist <- ggplot(brfss_mlr, aes(x = menthlth_days)) +
  geom_histogram(binwidth = 1, fill = "steelblue", color = "white", alpha = 0.85) +
  labs(
    title    = "Distribution of Mentally Unhealthy Days in the Past 30 Days",
    subtitle = "BRFSS 2020 Analytic Sample (n = 5,000)",
    x        = "Number of Mentally Unhealthy Days",
    y        = "Count"
  ) +
  theme_minimal(base_size = 13)

ggplotly(p_hist)

Distribution of Mentally Unhealthy Days (BRFSS 2020)

1c. (5 pts) Create a scatterplot matrix (using GGally::ggpairs() or similar) for the continuous variables: menthlth_days, physhlth_days, sleep_hrs, and age. Comment on the direction and strength of each pairwise correlation with the outcome.

library(GGally)

brfss_mlr %>%
  select(menthlth_days, physhlth_days, sleep_hrs, age) %>%
  ggpairs()

Task 2: Unadjusted (Simple) Linear Regression (15 points)

2a. (5 pts) Fit a simple linear regression model regressing menthlth_days on sleep_hrs alone. Write out the fitted regression equation.

model_A <- lm(menthlth_days ~ sleep_hrs, data =brfss_mlr )

summary(model_A)

## 
## Call:
## lm(formula = menthlth_days ~ sleep_hrs, data = brfss_mlr)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.670 -3.845 -3.040 -0.040 31.785 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.47429    0.57712   16.42   <2e-16 ***
## sleep_hrs   -0.80424    0.08025  -10.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.642 on 4998 degrees of freedom
## Multiple R-squared:  0.0197, Adjusted R-squared:  0.0195 
## F-statistic: 100.4 on 1 and 4998 DF,  p-value: < 2.2e-16

2b. (5 pts) Interpret the slope for sleep in a single sentence appropriate for a public health audience (no statistical jargon).

Ans: Each additional hour of sleep per night is associated with about 0.8 fewer days of poor mental health per month, on average.

2c. (5 pts) State the null and alternative hypotheses for the slope, report the t-statistic and p-value, and state your conclusion. What is the degree of freedom for this test?

Null Hypothesis 𝐻0 : 𝛽sleep = 0 There is no relationship between sleep duration and poor mental health days.

Alternative Hypothesis is

HA :𝛽sleep≠0 is Sleep duration is associated with poor mental health days.

Task 3: Building the Multivariable Model (25 points)

3a. (5 pts) Fit three models:

Model A: menthlth_days ~ sleep_hrs
Model B: menthlth_days ~ sleep_hrs + age + sex
Model C: menthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise

model_A <- lm(menthlth_days ~ sleep_hrs, data = brfss_mlr)

model_B <- lm(menthlth_days ~ sleep_hrs + age + sex,
              data = brfss_mlr)

model_C <- lm(menthlth_days ~ sleep_hrs + age + sex +
               physhlth_days + income_cat + exercise,
              data = brfss_mlr)

3b. (10 pts) Create a table comparing the sleep coefficient ($\hat{\beta}$, SE, 95% CI, p-value) across Models A, B, and C. Does the sleep coefficient change substantially when you add more covariates? What does this suggest about confounding?

library(broom)

bind_rows(
  tidy(model_A, conf.int = TRUE) %>% filter(term == "sleep_hrs") %>% mutate(Model = "A"),
  tidy(model_B, conf.int = TRUE) %>% filter(term == "sleep_hrs") %>% mutate(Model = "B"),
  tidy(model_C, conf.int = TRUE) %>% filter(term == "sleep_hrs") %>% mutate(Model = "C")
)

## # A tibble: 3 × 8
##   term      estimate std.error statistic  p.value conf.low conf.high Model
##   <chr>        <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl> <chr>
## 1 sleep_hrs   -0.804    0.0802    -10.0  2.03e-23   -0.962    -0.647 A    
## 2 sleep_hrs   -0.734    0.0793     -9.25 3.17e-20   -0.889    -0.578 B    
## 3 sleep_hrs   -0.509    0.0753     -6.76 1.57e-11   -0.657    -0.361 C

AnS : The coefficient for sleep hours decreases as additional covariates are added to the model. In Model A, each additional hour of sleep is associated with 0.80 fewer poor mental health days per month. After adjusting for age and sex in Model B, the estimate becomes slightly smaller (−0.73). In Model C, after further adjusting for physical health days, income, and exercise, the association decreases further to −0.51. This change suggests that some of the relationship between sleep and mental health is explained by other factors, particularly physical health and socioeconomic characteristics. However, the association remains strong and statistically meaningful across all models, indicating that sleep duration is independently related to mental health days even after accounting for these potential confounders.

3c. (10 pts) For Model C, write out the full fitted regression equation and interpret every coefficient in plain language appropriate for a public health report.

Ans : Intercept (β₀)- The intercept represents the predicted number of poor mental health days for a person in the reference categories (baseline sex, income group, and exercise status) when sleep hours, age, and physical health days are zero. While this value helps define the regression line mathematically, it is not typically meaningful in a real-world context.

Sleep Hours (β₁ = -0.509) Each additional hour of sleep per night is associated with about 0.51 fewer poor mental health days per month, on average, after accounting for age, sex, physical health, income, and exercise.

Age (β₂) Holding other factors constant, a one-year increase in age is associated with a β₂ change in the number of poor mental health days per month.

Sex (β₃) Compared with the reference sex group (usually males), individuals in the other sex category report β₃ more or fewer poor mental health days per month, on average, after accounting for other factors.

Physical Health Days (β₄) Each additional day of poor physical health is associated with β₄ additional poor mental health days, suggesting that physical and mental health problems often occur together.

Income Category (β₅) Compared with the reference income group, individuals in higher (or lower) income categories report β₅ fewer or more poor mental health days per month, after adjusting for other characteristics.

Exercise (β₆) People who report exercising have β₆ fewer (or more) poor mental health days per month compared with those who do not exercise, holding other factors constant.

Task 4: Model Fit and ANOVA (20 points)

4a. (5 pts) Report $R^2$ and Adjusted $R^2$ for each of the three models (A, B, C). Create a table. Which model explains the most variance in mental health days?

library(broom)

glance(model_A)

## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic  p.value    df  logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>   <dbl>  <dbl>  <dbl>
## 1    0.0197        0.0195  7.64      100. 2.03e-23     1 -17262. 34529. 34549.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

glance(model_B)

## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic  p.value    df  logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>   <dbl>  <dbl>  <dbl>
## 1    0.0504        0.0498  7.52      88.3 1.10e-55     3 -17182. 34374. 34407.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

glance(model_C)

## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic   p.value    df  logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>   <dbl>  <dbl>  <dbl>
## 1     0.157         0.156  7.09      155. 6.48e-181     6 -16885. 33785. 33837.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Ans: Model C explains the most variance in mental health days

4b. (5 pts) What is the Root MSE for Model C? Interpret it in practical terms — what does it tell you about prediction accuracy?

Ans: Root MSE = 7.09. On average, the predicted number of poor mental health days from Model C differs from the actual reported number of poor mental health days by about 7 days per month.

4c. (10 pts) Using the ANOVA output for Model C, fill in the following table manually (i.e., compute the values using the output from anova() or glance()):

anova(model_C)

## Analysis of Variance Table
## 
## Response: menthlth_days
##                 Df Sum Sq Mean Sq  F value    Pr(>F)    
## sleep_hrs        1   5865  5864.8 116.6678 < 2.2e-16 ***
## age              1   6182  6182.2 122.9832 < 2.2e-16 ***
## sex              1   2947  2947.1  58.6266 2.274e-14 ***
## physhlth_days    1  29456 29455.5 585.9585 < 2.2e-16 ***
## income_cat       1   2177  2176.8  43.3031 5.169e-11 ***
## exercise         1     92    92.1   1.8326    0.1759    
## Residuals     4993 250993    50.3                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Source	df	SS	MS	F
Model	6	46,719	7,786.5	154.8
Residual	4,993	250,993	50.3
Total	4,999	297,712

State the null hypothesis for the overall F-test and your conclusion. Ans: Null Hypothesis (H₀) :All regression coefficients are equal to zero. Alternative Hypothesis (H₁) : At least one predictor is associated with mental health days.

The overall F-test is highly statistically significant (p < 0.001). Therefore, we reject the null hypothesis and conclude that the predictors in Model C collectively explain variation in the number of poor mental health days. At least one of the variables in the model is associated with mental health days.

Task 5: Residual Diagnostics (15 points)

5a. (5 pts) For Model C, produce the four standard diagnostic plots (Residuals vs. Fitted, Normal Q-Q, Scale-Location, Cook’s Distance). Comment on what each plot tells you about the LINE assumptions.

par(mfrow=c(2,2))
plot(model_C)

Ans: Residual vs fitted suggests possible non linearity and heteroscedasticity.

Q-Q plot shows clear deviation from normality.

Scale-location plot indicates non-constant variance.

Residual vs leverage shows no severe influential observations.

5b. (5 pts) Given the nature of the outcome (mental health days, bounded at 0 and 30, heavily right-skewed), which assumptions are most likely to be violated? Does this invalidate the analysis? Explain.

Ans: the outcome is bounded (0–30), discrete, and right-skewed, violations of normality and equal variance are expected. These issues do not necessarily invalidate the analysis, especially with a large sample size, but suggest that alternative models for count data (e.g., Poisson or negative binomial regression) could be more appropriate.

5c. (5 pts) Identify any observations with Cook’s Distance > 1. How many are there? What would you do with them in a real analysis?

cooks <- cooks.distance(model_C)

sum(cooks > 1)

## [1] 0

Ans: 0 observation.No observations had Cook’s Distance greater than 1.

Interpretation:In a real analysis, influential points should be investigated for data errors or unusual cases before deciding whether to exclude them.

Task 6: Interpretation for a Public Health Audience (10 points)

Suppose you are writing the results section of a public health paper. Write a 3–4 sentence paragraph summarizing the findings from Model C for a non-statistical audience. Your paragraph should:

Identify which predictors were significantly associated with mental health days
State the direction and approximate magnitude of the most important associations
Appropriately caveat the cross-sectional nature of the data (no causal language)
Not use any statistical jargon (no “significant,” “coefficient,” “p-value”)

Ans : Adults who reported fewer hours of sleep tended to experience more days of poor mental health during the past month. Individuals with more physical health problems also reported substantially more poor mental health days. Higher income and regular exercise were associated with fewer days of poor mental health. Because the data were collected at a single point in time, these findings show relationships between these factors but do not establish cause-and-effect.

End of Lab Activity