1. Load your chosen dataset into Rmarkdown
library(readxl)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(pastecs)
## 
## Attaching package: 'pastecs'
## 
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## 
## The following object is masked from 'package:tidyr':
## 
##     extract
load('NSDUH_2023.Rdata')
  1. Select the dependent variable you are interested in, along with independent variables which you believe are causing the dependent variable #This dataset has 56,705 observations across 2,000 variables, so I made a smaller dataset with just four variables. This will be called “AlcoholMHlarge”. Variables are # of Days in the last year drinking alcohol (ALCYRTOT), #of days in the last 30 days had four/five drinks (ALCBNG30D), # of days could not leave the house (IMPGOUTM), and # of days in the last year could not go to work (IMPYDAYS). IMPYDAYS is the dependent variable.
AlcoholMHlarge<-select(puf2023_102124, ALCYRTOT,ALCBNG30D, IMPGOUTM, IMPYDAYS)

summary(AlcoholMHlarge)
##     ALCYRTOT       ALCBNG30D        IMPGOUTM        IMPYDAYS    
##  Min.   :  1.0   Min.   : 0.00   Min.   : 1.00   Min.   :  0.0  
##  1st Qu.: 36.0   1st Qu.: 1.00   1st Qu.:99.00   1st Qu.:  1.0  
##  Median :200.0   Median :91.00   Median :99.00   Median :999.0  
##  Mean   :466.2   Mean   :54.14   Mean   :97.03   Mean   :554.4  
##  3rd Qu.:991.0   3rd Qu.:93.00   3rd Qu.:99.00   3rd Qu.:999.0  
##  Max.   :998.0   Max.   :98.00   Max.   :99.00   Max.   :999.0

#now I need to remove any that had values outside of 0-365 days for ALCYRTOT, IMPYDAYS, and remove the responses outside of 0 to 30 days for IMPGOUTM and ALCBNG30D.

AlcoholMH1 <- AlcoholMHlarge %>% 
  filter(ALCYRTOT >= 0 & ALCYRTOT <= 366 &
         IMPYDAYS >= 0 & IMPYDAYS <= 366 &
         ALCBNG30D >= 0 & ALCBNG30D <= 31 &
         IMPGOUTM >= 0 & IMPGOUTM <= 31)

summary(AlcoholMH1)
##     ALCYRTOT        ALCBNG30D         IMPGOUTM       IMPYDAYS     
##  Min.   :  1.00   Min.   : 0.000   Min.   :1.00   Min.   :  0.00  
##  1st Qu.: 12.00   1st Qu.: 0.000   1st Qu.:1.00   1st Qu.:  3.00  
##  Median : 48.00   Median : 1.000   Median :1.00   Median : 30.00  
##  Mean   : 83.47   Mean   : 2.462   Mean   :1.21   Mean   : 82.11  
##  3rd Qu.:126.00   3rd Qu.: 2.000   3rd Qu.:1.00   3rd Qu.:120.00  
##  Max.   :364.00   Max.   :30.000   Max.   :2.00   Max.   :365.00
  1. create a linear model using the “lm()” command, save it to some object
AlcoholLM <- lm(IMPYDAYS~ALCYRTOT+ALCBNG30D+IMPGOUTM,data=AlcoholMH1)
  1. call a “summary()” on your new model
summary(AlcoholLM)
## 
## Call:
## lm(formula = IMPYDAYS ~ ALCYRTOT + ALCBNG30D + IMPGOUTM, data = AlcoholMH1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -146.37  -79.15  -24.21   19.96  348.04 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 162.33029   16.76531   9.683  < 2e-16 ***
## ALCYRTOT     -0.03507    0.06806  -0.515   0.6066    
## ALCBNG30D     3.07479    1.24995   2.460   0.0143 *  
## IMPGOUTM    -70.16258   12.55137  -5.590 4.11e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 104.4 on 416 degrees of freedom
## Multiple R-squared:  0.08877,    Adjusted R-squared:  0.0822 
## F-statistic: 13.51 on 3 and 416 DF,  p-value: 1.992e-08
  1. interpret the model’s r-squared and p-values. How much of the dependent variable does the overall model explain? What are the significant variables? What are the insignificant variables?

#The Adjusted R Square is 0.0822 and means that this model only explains 8% of the IMPYDAYS. The P-Value is very tiny (1.992e-08), which is statistically significant. The P-Value of IMPGOUTM is marked as highly significant and ALCBNG30D is marked as medium significant. I’m worried that IMPGOUTM is too closely related to IMPYDAYS and they may be impacting each other too much. #From my Annotated Bibliography, I know that others have found that binge drinking can have more impacts on negative behavior than simply having one drink of alcohol every single day. That seems to be showing up here in the P-values of ALCYRTOT (insignificant) and ALCBNGH30D (medium significant).

  1. Choose some significant independent variables. Interpret its Estimates (or Beta Coefficients). How do the independent variables individually affect the dependent variable?

#In reviewing ALCBNG30D, this is a significant independent variable. It’s Estimate above shows as 3.074. This means that for every day increased for ALCBNG30D (drinking more than 4-5 drinks), then IMPYDAYS (missed work) increases by 3 days. That seems very significant in a real-world application. For example, if I have 4 drinks on one night, that means I’ll miss three days of work over the next little while. #I wonder if this is missrepresented since IMPYDAYS is # of days over the last 12 months an ALCBNG30D is over the last 30 days.

  1. Does the model you create meet or violate the assumption of linearity? Show your work with “plot(x,which=1)”
plot(AlcoholLM,which=1)

#This red line looks straight at the beginning, but then is quite curved at the right third. Also, the data is plotted in chunks and not evenly spread out around the red line. I would say this data set is NOT linear.