library(readxl)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(pastecs)
##
## Attaching package: 'pastecs'
##
## The following objects are masked from 'package:dplyr':
##
## first, last
##
## The following object is masked from 'package:tidyr':
##
## extract
load('NSDUH_2023.Rdata')
AlcoholMHlarge<-select(puf2023_102124, ALCYRTOT,ALCBNG30D, IMPGOUTM, IMPYDAYS)
summary(AlcoholMHlarge)
## ALCYRTOT ALCBNG30D IMPGOUTM IMPYDAYS
## Min. : 1.0 Min. : 0.00 Min. : 1.00 Min. : 0.0
## 1st Qu.: 36.0 1st Qu.: 1.00 1st Qu.:99.00 1st Qu.: 1.0
## Median :200.0 Median :91.00 Median :99.00 Median :999.0
## Mean :466.2 Mean :54.14 Mean :97.03 Mean :554.4
## 3rd Qu.:991.0 3rd Qu.:93.00 3rd Qu.:99.00 3rd Qu.:999.0
## Max. :998.0 Max. :98.00 Max. :99.00 Max. :999.0
#now I need to remove any that had values outside of 0-365 days for ALCYRTOT, IMPYDAYS, and remove the responses outside of 0 to 30 days for IMPGOUTM and ALCBNG30D.
AlcoholMH1 <- AlcoholMHlarge %>%
filter(ALCYRTOT >= 0 & ALCYRTOT <= 366 &
IMPYDAYS >= 0 & IMPYDAYS <= 366 &
ALCBNG30D >= 0 & ALCBNG30D <= 31 &
IMPGOUTM >= 0 & IMPGOUTM <= 31)
summary(AlcoholMH1)
## ALCYRTOT ALCBNG30D IMPGOUTM IMPYDAYS
## Min. : 1.00 Min. : 0.000 Min. :1.00 Min. : 0.00
## 1st Qu.: 12.00 1st Qu.: 0.000 1st Qu.:1.00 1st Qu.: 3.00
## Median : 48.00 Median : 1.000 Median :1.00 Median : 30.00
## Mean : 83.47 Mean : 2.462 Mean :1.21 Mean : 82.11
## 3rd Qu.:126.00 3rd Qu.: 2.000 3rd Qu.:1.00 3rd Qu.:120.00
## Max. :364.00 Max. :30.000 Max. :2.00 Max. :365.00
AlcoholLM <- lm(IMPYDAYS~ALCYRTOT+ALCBNG30D+IMPGOUTM,data=AlcoholMH1)
summary(AlcoholLM)
##
## Call:
## lm(formula = IMPYDAYS ~ ALCYRTOT + ALCBNG30D + IMPGOUTM, data = AlcoholMH1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -146.37 -79.15 -24.21 19.96 348.04
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 162.33029 16.76531 9.683 < 2e-16 ***
## ALCYRTOT -0.03507 0.06806 -0.515 0.6066
## ALCBNG30D 3.07479 1.24995 2.460 0.0143 *
## IMPGOUTM -70.16258 12.55137 -5.590 4.11e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 104.4 on 416 degrees of freedom
## Multiple R-squared: 0.08877, Adjusted R-squared: 0.0822
## F-statistic: 13.51 on 3 and 416 DF, p-value: 1.992e-08
#The Adjusted R Square is 0.0822 and means that this model only explains 8% of the IMPYDAYS. The P-Value is very tiny (1.992e-08), which is statistically significant. The P-Value of IMPGOUTM is marked as highly significant and ALCBNG30D is marked as medium significant. I’m worried that IMPGOUTM is too closely related to IMPYDAYS and they may be impacting each other too much. #From my Annotated Bibliography, I know that others have found that binge drinking can have more impacts on negative behavior than simply having one drink of alcohol every single day. That seems to be showing up here in the P-values of ALCYRTOT (insignificant) and ALCBNGH30D (medium significant).
#In reviewing ALCBNG30D, this is a significant independent variable. It’s Estimate above shows as 3.074. This means that for every day increased for ALCBNG30D (drinking more than 4-5 drinks), then IMPYDAYS (missed work) increases by 3 days. That seems very significant in a real-world application. For example, if I have 4 drinks on one night, that means I’ll miss three days of work over the next little while. #I wonder if this is missrepresented since IMPYDAYS is # of days over the last 12 months an ALCBNG30D is over the last 30 days.
plot(AlcoholLM,which=1)
#This red line looks straight at the beginning, but then is quite curved at the right third. Also, the data is plotted in chunks and not evenly spread out around the red line. I would say this data set is NOT linear.