PAD 6833 HW7

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(readxl)
library(lmtest)

## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(MASS)

## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select

duh_data <- load("NSDUH 2022.Rdata")
nsduh <- puf2022_110424

model1<-lm(DSTNRV30 ~ BNGDRMDAYS + MJDAY30A + CIG30USE + COCUS30A + METHAM30N + PNRNM30FQ, data = nsduh)

plot(model1,which=1)

raintest(model1)

## 
##  Rainbow test
## 
## data:  model1
## Rain = 1.0151, df1 = 29535, df2 = 29527, p-value = 0.09906

because the pvalue is greater than .05, there is no evidence of nonlinearity and the model meets the linearity assumption.

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

durbinWatsonTest(model1)

##  lag Autocorrelation D-W Statistic p-value
##    1    -0.005955922      2.011897   0.162
##  Alternative hypothesis: rho != 0

because the pvalue is greater than .05, this means the errors are independent and the model meets the independence of errors assumption

plot(model1,which=3)

bptest(model1)

## 
##  studentized Breusch-Pagan test
## 
## data:  model1
## BP = 4189.2, df = 6, p-value < 2.2e-16

the plot shows a noticable curved pattern which indicates that the variance is not constant across the model; additionaly, the BP test has a pvalue well below .05 meaning homoscedasticity is rejected and violates the assumption

plot(model1,which=2)

ks.test(model1$residuals,"pnorm")

## Warning in ks.test.default(model1$residuals, "pnorm"): ties should not be
## present for the one-sample Kolmogorov-Smirnov test

## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  model1$residuals
## D = 0.72628, p-value < 2.2e-16
## alternative hypothesis: two-sided

my plot shows a strong S shape which means they deviate from the line. This indicates that the residuals are not normally distributed and the normality assumption is violated. Because the dataset is so large, the sample size exceeds the shapiro test limit and could not be run.

vif(model1)

## BNGDRMDAYS   MJDAY30A   CIG30USE   COCUS30A  METHAM30N  PNRNM30FQ 
##   1.064535   1.093127   1.077673   1.023520   1.034364   1.025844

every VIF value is close to 1, this indicates that the variables are not strongly correlated and the model meets the assumption of no multicolinearity

##mitigation ###homoscedasticity: one way to address this is to transform the dependent variable using a log ##normality of residuals: a solution could be to apply a transformation using log to the dependent variable to make the residuals more normally distrubuted.