library(readxl)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(pastecs)
##
## Attaching package: 'pastecs'
##
## The following objects are masked from 'package:dplyr':
##
## first, last
##
## The following object is masked from 'package:tidyr':
##
## extract
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(MASS)
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
load('NSDUH_2023.Rdata')
I need to make this into a smaller dataset for ease of analysis.
select <- dplyr::select
AlcoholMHlarge<-select(puf2023_102124, ALCYRTOT,ALCBNG30D, IMPGOUTM, IMPYDAYS)
summary(AlcoholMHlarge)
## ALCYRTOT ALCBNG30D IMPGOUTM IMPYDAYS
## Min. : 1.0 Min. : 0.00 Min. : 1.00 Min. : 0.0
## 1st Qu.: 36.0 1st Qu.: 1.00 1st Qu.:99.00 1st Qu.: 1.0
## Median :200.0 Median :91.00 Median :99.00 Median :999.0
## Mean :466.2 Mean :54.14 Mean :97.03 Mean :554.4
## 3rd Qu.:991.0 3rd Qu.:93.00 3rd Qu.:99.00 3rd Qu.:999.0
## Max. :998.0 Max. :98.00 Max. :99.00 Max. :999.0
Next, I need to remove the values of the variables that exceed 30 days or 365 days.
AlcoholMH1 <- AlcoholMHlarge %>%
filter(ALCYRTOT >= 0 & ALCYRTOT <= 366 &
IMPYDAYS >= 0 & IMPYDAYS <= 366 &
ALCBNG30D >= 0 & ALCBNG30D <= 31 &
IMPGOUTM >= 0 & IMPGOUTM <= 31)
summary(AlcoholMH1)
## ALCYRTOT ALCBNG30D IMPGOUTM IMPYDAYS
## Min. : 1.00 Min. : 0.000 Min. :1.00 Min. : 0.00
## 1st Qu.: 12.00 1st Qu.: 0.000 1st Qu.:1.00 1st Qu.: 3.00
## Median : 48.00 Median : 1.000 Median :1.00 Median : 30.00
## Mean : 83.47 Mean : 2.462 Mean :1.21 Mean : 82.11
## 3rd Qu.:126.00 3rd Qu.: 2.000 3rd Qu.:1.00 3rd Qu.:120.00
## Max. :364.00 Max. :30.000 Max. :2.00 Max. :365.00
Alcoholmodel_1<-lm(IMPYDAYS~ALCYRTOT+ALCBNG30D,data=AlcoholMH1)
I used the #of days of missed work as the dependent variable and the # of days drinking and the # of days binge drinking in the linear model. I called it Alcoholmodel_1.
plot(Alcoholmodel_1,which=1)
raintest(Alcoholmodel_1)
##
## Rainbow test
##
## data: Alcoholmodel_1
## Rain = 0.98924, df1 = 210, df2 = 207, p-value = 0.5312
Above is the plot of this lm. The fitted line is not straight and therefore the data is not linear.
Next, we run the Rainbow test which has a p-value of 0.5312 which is not low, and therefore not statistically significant. This means that the # of total days drinking (ALCYRTOT) and the # of days binging (ALCBNG30D) do not have statistically significant effect on # of days missed work (IMPYDAYS)
But let’s test for the other assumptions.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
durbinWatsonTest(Alcoholmodel_1)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.02516355 2.047404 0.63
## Alternative hypothesis: rho != 0
Next we look at the Durbin Watson test, which brings about the p-value of 0.63 which is high and means the errors are independent and there is no correlation between the variables. Also, the DW is basically 2, which also means no correlation detected; this is good news.
plot(Alcoholmodel_1,which=3)
bptest(Alcoholmodel_1)
##
## studentized Breusch-Pagan test
##
## data: Alcoholmodel_1
## BP = 2.0513, df = 2, p-value = 0.3586
This plot is showing that the line is pretty straight, although flowing slightly at an angle. The points don’t seem evenly distributed around the line to me. This would demonstrate a violation of Homoscedasticity.
The second command for the Breusch-Pagan test shows a result of p-value of 0.35 which means the null hypothesis cannot be rejected and the model is homoscedasticy can be assumed, which conflicts with the above, so I’m confused.
plot(Alcoholmodel_1,which=2)
There is significant deviation from the dotted line, which means
residuals are not evenly distributed.
shapiro.test(Alcoholmodel_1$residuals)
##
## Shapiro-Wilk normality test
##
## data: Alcoholmodel_1$residuals
## W = 0.79402, p-value < 2.2e-16
The p-value is WAY WAY below 0.05, which means that the residuals are not normal.
Alcoholmodel_2<-lm(IMPYDAYS~ALCYRTOT+ALCBNG30D+IMPGOUTM,data=AlcoholMH1)
vif(Alcoholmodel_2)
## ALCYRTOT ALCBNG30D IMPGOUTM
## 1.395309 1.401562 1.005543
This is interersting that all three variables are right about 1.0 which means they are NOT correlated to another variable, meaning this assumption is NOT violated.
cor(AlcoholMH1)
## ALCYRTOT ALCBNG30D IMPGOUTM IMPYDAYS
## ALCYRTOT 1.000000000 0.53162182 -0.009641016 0.04650247
## ALCBNG30D 0.531621816 1.00000000 -0.067479458 0.13885904
## IMPGOUTM -0.009641016 -0.06747946 1.000000000 -0.27127344
## IMPYDAYS 0.046502472 0.13885904 -0.271273440 1.00000000
The Correlation is showing that none of these variables have a super high correlation, but the closest are # of days drinking (ALCYRTOT) and # of days binge drinking (ALCBNG30D) but that’s only at 0.53.
If your model violates an assumption, which one? This model is violating linearity, homoscedasticity (maybe) and normality of residuals.
What would you do to mitigate this assumption? Show your work. I will try a transformation in order to mitigate linearity, Homoscedasticity and normality of residuals. I can’t use log because there will be zeroes since legitimate responses may be zero days for one of the four variables. I can try using square root instead.
Alcohol_Sqrt<-lm(sqrt(IMPYDAYS)~sqrt(ALCBNG30D),data=AlcoholMH1)
plot(Alcohol_Sqrt,which=1)
I suppose the fitted line is a bit more straigt here and the residuals
are certainly more evenly spread around the fitted line.
Now I can check for any improvement in homoscedasticity and normality of residuals
plot(Alcohol_Sqrt,which=3)
bptest(Alcohol_Sqrt)
##
## studentized Breusch-Pagan test
##
## data: Alcohol_Sqrt
## BP = 1.166, df = 1, p-value = 0.2802
I was on the fence anyway on the homoscedasticity and I’m still not sure if this plot demonstrates enough of a transformation to meet this assumption. The p value of the BP test is significant on its own though.
Next, let’s look at normality of residuals.
plot(Alcohol_Sqrt,which=2)
This looks WAY better with many points hitting the line or near the line. Plus, the start and end of the points trail off, like we would expect.
With this Square root transformation we can meet more of the assumptions.