Homework7

Load your preferred dataset into R studio

library(readxl)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(pastecs)

## 
## Attaching package: 'pastecs'
## 
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## 
## The following object is masked from 'package:tidyr':
## 
##     extract

library(lmtest)

## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(MASS)

## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select

load('NSDUH_2023.Rdata')

I need to make this into a smaller dataset for ease of analysis.

select <- dplyr::select

AlcoholMHlarge<-select(puf2023_102124, ALCYRTOT,ALCBNG30D, IMPGOUTM, IMPYDAYS)

summary(AlcoholMHlarge)

##     ALCYRTOT       ALCBNG30D        IMPGOUTM        IMPYDAYS    
##  Min.   :  1.0   Min.   : 0.00   Min.   : 1.00   Min.   :  0.0  
##  1st Qu.: 36.0   1st Qu.: 1.00   1st Qu.:99.00   1st Qu.:  1.0  
##  Median :200.0   Median :91.00   Median :99.00   Median :999.0  
##  Mean   :466.2   Mean   :54.14   Mean   :97.03   Mean   :554.4  
##  3rd Qu.:991.0   3rd Qu.:93.00   3rd Qu.:99.00   3rd Qu.:999.0  
##  Max.   :998.0   Max.   :98.00   Max.   :99.00   Max.   :999.0

Next, I need to remove the values of the variables that exceed 30 days or 365 days.

AlcoholMH1 <- AlcoholMHlarge %>% 
  filter(ALCYRTOT >= 0 & ALCYRTOT <= 366 &
         IMPYDAYS >= 0 & IMPYDAYS <= 366 &
         ALCBNG30D >= 0 & ALCBNG30D <= 31 &
         IMPGOUTM >= 0 & IMPGOUTM <= 31)

summary(AlcoholMH1)

##     ALCYRTOT        ALCBNG30D         IMPGOUTM       IMPYDAYS     
##  Min.   :  1.00   Min.   : 0.000   Min.   :1.00   Min.   :  0.00  
##  1st Qu.: 12.00   1st Qu.: 0.000   1st Qu.:1.00   1st Qu.:  3.00  
##  Median : 48.00   Median : 1.000   Median :1.00   Median : 30.00  
##  Mean   : 83.47   Mean   : 2.462   Mean   :1.21   Mean   : 82.11  
##  3rd Qu.:126.00   3rd Qu.: 2.000   3rd Qu.:1.00   3rd Qu.:120.00  
##  Max.   :364.00   Max.   :30.000   Max.   :2.00   Max.   :365.00

Create a linear model “lm()” from the variables, with a continuous dependent variable as the outcome

Alcoholmodel_1<-lm(IMPYDAYS~ALCYRTOT+ALCBNG30D,data=AlcoholMH1)

I used the #of days of missed work as the dependent variable and the # of days drinking and the # of days binge drinking in the linear model. I called it Alcoholmodel_1.

Check the following assumptions:

Linearity (plot and raintest)

plot(Alcoholmodel_1,which=1)

raintest(Alcoholmodel_1)

## 
##  Rainbow test
## 
## data:  Alcoholmodel_1
## Rain = 0.98924, df1 = 210, df2 = 207, p-value = 0.5312

Above is the plot of this lm. The fitted line is not straight and therefore the data is not linear.

Next, we run the Rainbow test which has a p-value of 0.5312 which is not low, and therefore not statistically significant. This means that the # of total days drinking (ALCYRTOT) and the # of days binging (ALCBNG30D) do not have statistically significant effect on # of days missed work (IMPYDAYS)

But let’s test for the other assumptions.

Independence of errors (durbin-watson)

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

durbinWatsonTest(Alcoholmodel_1)

##  lag Autocorrelation D-W Statistic p-value
##    1     -0.02516355      2.047404    0.63
##  Alternative hypothesis: rho != 0

Next we look at the Durbin Watson test, which brings about the p-value of 0.63 which is high and means the errors are independent and there is no correlation between the variables. Also, the DW is basically 2, which also means no correlation detected; this is good news.

Homoscedasticity (plot, bptest)

plot(Alcoholmodel_1,which=3)

bptest(Alcoholmodel_1)

## 
##  studentized Breusch-Pagan test
## 
## data:  Alcoholmodel_1
## BP = 2.0513, df = 2, p-value = 0.3586

This plot is showing that the line is pretty straight, although flowing slightly at an angle. The points don’t seem evenly distributed around the line to me. This would demonstrate a violation of Homoscedasticity.

The second command for the Breusch-Pagan test shows a result of p-value of 0.35 which means the null hypothesis cannot be rejected and the model is homoscedasticy can be assumed, which conflicts with the above, so I’m confused.

Normality of residuals (QQ plot, shapiro test)

plot(Alcoholmodel_1,which=2)

There is significant deviation from the dotted line, which means residuals are not evenly distributed.

shapiro.test(Alcoholmodel_1$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  Alcoholmodel_1$residuals
## W = 0.79402, p-value < 2.2e-16

The p-value is WAY WAY below 0.05, which means that the residuals are not normal.

No multicolinarity (VIF, cor)

Alcoholmodel_2<-lm(IMPYDAYS~ALCYRTOT+ALCBNG30D+IMPGOUTM,data=AlcoholMH1)

vif(Alcoholmodel_2)

##  ALCYRTOT ALCBNG30D  IMPGOUTM 
##  1.395309  1.401562  1.005543

This is interersting that all three variables are right about 1.0 which means they are NOT correlated to another variable, meaning this assumption is NOT violated.

cor(AlcoholMH1)

##               ALCYRTOT   ALCBNG30D     IMPGOUTM    IMPYDAYS
## ALCYRTOT   1.000000000  0.53162182 -0.009641016  0.04650247
## ALCBNG30D  0.531621816  1.00000000 -0.067479458  0.13885904
## IMPGOUTM  -0.009641016 -0.06747946  1.000000000 -0.27127344
## IMPYDAYS   0.046502472  0.13885904 -0.271273440  1.00000000

The Correlation is showing that none of these variables have a super high correlation, but the closest are # of days drinking (ALCYRTOT) and # of days binge drinking (ALCBNG30D) but that’s only at 0.53.

does your model meet those assumptions? You don’t have to be perfectly right, just make a good case. From what I see, the result is the following:

Linearity (plot and raintest) - DOES NOT MEET
Independence of errors (durbin-watson) - MEETS THIS ASSUMPTION
Homoscedasticity (plot, bptest) - I’m not sure on this…the graph didn’t show this, but the bptest had a p-value of 0.35
Normality of residuals (QQ plot, shapiro test) - DOES NOT MEET THIS ASSUMPTION
No multicolinarity (VIF, cor) - MEETS THIS ASSUMPTION and the variables are not correlated to each other.

If your model violates an assumption, which one? This model is violating linearity, homoscedasticity (maybe) and normality of residuals.
What would you do to mitigate this assumption? Show your work. I will try a transformation in order to mitigate linearity, Homoscedasticity and normality of residuals. I can’t use log because there will be zeroes since legitimate responses may be zero days for one of the four variables. I can try using square root instead.

Alcohol_Sqrt<-lm(sqrt(IMPYDAYS)~sqrt(ALCBNG30D),data=AlcoholMH1)

plot(Alcohol_Sqrt,which=1)

I suppose the fitted line is a bit more straigt here and the residuals are certainly more evenly spread around the fitted line.

Now I can check for any improvement in homoscedasticity and normality of residuals

plot(Alcohol_Sqrt,which=3)

bptest(Alcohol_Sqrt)

## 
##  studentized Breusch-Pagan test
## 
## data:  Alcohol_Sqrt
## BP = 1.166, df = 1, p-value = 0.2802

I was on the fence anyway on the homoscedasticity and I’m still not sure if this plot demonstrates enough of a transformation to meet this assumption. The p value of the BP test is significant on its own though.

Next, let’s look at normality of residuals.

plot(Alcohol_Sqrt,which=2)

This looks WAY better with many points hitting the line or near the line. Plus, the start and end of the points trail off, like we would expect.

With this Square root transformation we can meet more of the assumptions.

Homework7

Megan Janzen

2025-04-05