library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'ggplot2' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lmtest)
## Warning: package 'lmtest' was built under R version 4.4.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.4.3
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(MASS)
## Warning: package 'MASS' was built under R version 4.4.3
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
library(car)
## Warning: package 'car' was built under R version 4.4.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.4.3
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
pavements<-read.csv("pavements_3192083553624189959.csv")
pavements_model<-lm(PCI~RoadFunction+OneWay+MaintenanceResponsibility,data=pavements)
plot(pavements_model,which=1)
raintest(pavements_model)
##
## Rainbow test
##
## data: pavements_model
## Rain = 0.99023, df1 = 48986, df2 = 48926, p-value = 0.8613
Plot: The fitted (red) line is not straight and the points are not scattered evenly around the line thus the model is not linear. This violates the assumption of Linearity. Rainbow Test: The p-value for the rainbow test is 0.8613. This value is high which means the data is linear. However, the test assumes homoscedasticity. As demonstrated later in the assignment, the model is not homoscedastic.
durbinWatsonTest(pavements_model)
## lag Autocorrelation D-W Statistic p-value
## 1 0.003009701 1.993979 0.36
## Alternative hypothesis: rho != 0
Durbin-Watson Test: The p-value is 0.314 which is greater than 0.05. This means the errors are independent. Also, the D-W is 1.993979 which is really good since it nearly 2. This passes the Independence of Errors assumption.
plot(pavements_model, which=3)
## Warning: not plotting observations with leverage one:
## 70996
bptest(pavements_model)
##
## studentized Breusch-Pagan test
##
## data: pavements_model
## BP = 6144.4, df = 59, p-value < 2.2e-16
Plot: The red line is curved and the points are scattered unevenly so the variance of the model is not the same across the different levels of the model. This violates the Homoscedasticity assumption. Breusch-Pagan Test: The p-value is 0.00000000000000022. This is a lot lower than 0.05 meaning the model is heteroscedastic. This supports the violation of the Homoscedasticity assumption.
plot(pavements_model, which=2)
## Warning: not plotting observations with leverage one:
## 70996
Plot: Only a small amount of the observations fall on the dotted line.
This means the residuals are not normally distributed. This violates the
Normality of Residuals assumption. Shapiro-Wilk Normality Test: This
test cannot be ran since there are too many observations.
vif(pavements_model)
## GVIF Df GVIF^(1/(2*Df))
## RoadFunction 17.075503 12 1.125509
## OneWay 1.040743 1 1.020168
## MaintenanceResponsibility 16.873965 46 1.031191
GVIF Test: Generalized Variance Inflation Factor test resulted in a GVIF of 17.07 and 16.87 for Road Function and Maintenance Responsibility, respectively. This means that both variables are strongly correlated with another variable, Oneway. This violates the No Multicolinarity assumption.
If your model violates an assumption, which one? The model violates linearity, homoscedasticity, normality of residuals, and no multicolinarity.
What would you do to mitigate this assumption? Show your work.
No Linearity? (try a log transformation) (may not always work)
pavements_modified<-pavements%>%na.omit(.)%>%filter(PCI>0)
pavements_log<-lm(log(PCI)~OneWay+RoadFunction+MaintenanceResponsibility,data=pavements_modified)
plot(pavements_log,which=1)
raintest(pavements_log)
##
## Rainbow test
##
## data: pavements_log
## Rain = 1.0169, df1 = 48267, df2 = 48207, p-value = 0.03278
I log my dependent variable and now the model is linear since the red line is nearly flat and the p-value of the rainbow test is 0.03278 which is significant since its less than 0.05. I need to confirm homoscedasticity.
Heteroscedascity? (log transformation of the dependent variable) (also may not always work)
pavements_model_log<-lm(log(PCI)~RoadFunction+OneWay+MaintenanceResponsibility,data=pavements_modified)
plot(pavements_model_log, which=3)
## Warning: not plotting observations with leverage one:
## 46043
bptest(pavements_model_log)
##
## studentized Breusch-Pagan test
##
## data: pavements_model_log
## BP = 3110.7, df = 59, p-value < 2.2e-16
I log the dependent variable. The red line is straight, but the points are still unevenly scattered so the model violates the homoscedasticity assumption. The p-value is 0.000000000000000022. This is a lot lower than 0.05 meaning the model is heteroscedastic. This supports the violation of the homoscedasticity assumption.
Residuals not normal? (log transformation)
plot(pavements_model_log,which=2)
## Warning: not plotting observations with leverage one:
## 46043
I log the dependent variable and still only a small amount of the
observations fall on the dotted line. This violates the Normality of
Residuals assumption.
Multicolinearity? (remove strongly correlated variables)
pavements_model_2<-lm(PCI~RoadFunction+OneWay,data=pavements)
vif(pavements_model_2)
## GVIF Df GVIF^(1/(2*Df))
## RoadFunction 1.025782 12 1.001061
## OneWay 1.025782 1 1.012809
I removed the oneway and got some large GVIF (16.?) so then I removed maintenance responsibility and got a GVIF of 1 which is less than 10 so the variables are not correlated and since its less than 5 it is not problematic.