library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'ggplot2' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lmtest)
## Warning: package 'lmtest' was built under R version 4.4.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.4.3
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(MASS)
## Warning: package 'MASS' was built under R version 4.4.3
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select
library(car)
## Warning: package 'car' was built under R version 4.4.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.4.3
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
  1. Load your preferred dataset into R studio
pavements<-read.csv("pavements_3192083553624189959.csv")
  1. Create a linear model “lm()” from the variables, with a continuous dependent variable as the outcome
pavements_model<-lm(PCI~RoadFunction+OneWay+MaintenanceResponsibility,data=pavements)
  1. Check the following assumptions:
  2. Does your model meet those assumptions? You don’t have to be perfectly right, just make a good case.
  1. Linearity (plot and raintest)
plot(pavements_model,which=1)

raintest(pavements_model)
## 
##  Rainbow test
## 
## data:  pavements_model
## Rain = 0.99023, df1 = 48986, df2 = 48926, p-value = 0.8613

Plot: The fitted (red) line is not straight and the points are not scattered evenly around the line thus the model is not linear. This violates the assumption of Linearity. Rainbow Test: The p-value for the rainbow test is 0.8613. This value is high which means the data is linear. However, the test assumes homoscedasticity. As demonstrated later in the assignment, the model is not homoscedastic.

  1. Independence of errors (durbin-watson)
durbinWatsonTest(pavements_model)
##  lag Autocorrelation D-W Statistic p-value
##    1     0.003009701      1.993979    0.36
##  Alternative hypothesis: rho != 0

Durbin-Watson Test: The p-value is 0.314 which is greater than 0.05. This means the errors are independent. Also, the D-W is 1.993979 which is really good since it nearly 2. This passes the Independence of Errors assumption.

  1. Homoscedasticity (plot, bptest)
plot(pavements_model, which=3)
## Warning: not plotting observations with leverage one:
##   70996

bptest(pavements_model)
## 
##  studentized Breusch-Pagan test
## 
## data:  pavements_model
## BP = 6144.4, df = 59, p-value < 2.2e-16

Plot: The red line is curved and the points are scattered unevenly so the variance of the model is not the same across the different levels of the model. This violates the Homoscedasticity assumption. Breusch-Pagan Test: The p-value is 0.00000000000000022. This is a lot lower than 0.05 meaning the model is heteroscedastic. This supports the violation of the Homoscedasticity assumption.

  1. Normality of residuals (QQ plot, shapiro test)
plot(pavements_model, which=2)
## Warning: not plotting observations with leverage one:
##   70996

Plot: Only a small amount of the observations fall on the dotted line. This means the residuals are not normally distributed. This violates the Normality of Residuals assumption. Shapiro-Wilk Normality Test: This test cannot be ran since there are too many observations.

  1. No multicolinarity (VIF, cor)
vif(pavements_model)
##                                GVIF Df GVIF^(1/(2*Df))
## RoadFunction              17.075503 12        1.125509
## OneWay                     1.040743  1        1.020168
## MaintenanceResponsibility 16.873965 46        1.031191

GVIF Test: Generalized Variance Inflation Factor test resulted in a GVIF of 17.07 and 16.87 for Road Function and Maintenance Responsibility, respectively. This means that both variables are strongly correlated with another variable, Oneway. This violates the No Multicolinarity assumption.

  1. If your model violates an assumption, which one? The model violates linearity, homoscedasticity, normality of residuals, and no multicolinarity.

  2. What would you do to mitigate this assumption? Show your work.

No Linearity? (try a log transformation) (may not always work)

pavements_modified<-pavements%>%na.omit(.)%>%filter(PCI>0)
pavements_log<-lm(log(PCI)~OneWay+RoadFunction+MaintenanceResponsibility,data=pavements_modified)
plot(pavements_log,which=1)

raintest(pavements_log)
## 
##  Rainbow test
## 
## data:  pavements_log
## Rain = 1.0169, df1 = 48267, df2 = 48207, p-value = 0.03278

I log my dependent variable and now the model is linear since the red line is nearly flat and the p-value of the rainbow test is 0.03278 which is significant since its less than 0.05. I need to confirm homoscedasticity.

Heteroscedascity? (log transformation of the dependent variable) (also may not always work)

pavements_model_log<-lm(log(PCI)~RoadFunction+OneWay+MaintenanceResponsibility,data=pavements_modified)
plot(pavements_model_log, which=3)
## Warning: not plotting observations with leverage one:
##   46043

bptest(pavements_model_log)
## 
##  studentized Breusch-Pagan test
## 
## data:  pavements_model_log
## BP = 3110.7, df = 59, p-value < 2.2e-16

I log the dependent variable. The red line is straight, but the points are still unevenly scattered so the model violates the homoscedasticity assumption. The p-value is 0.000000000000000022. This is a lot lower than 0.05 meaning the model is heteroscedastic. This supports the violation of the homoscedasticity assumption.

Residuals not normal? (log transformation)

plot(pavements_model_log,which=2)
## Warning: not plotting observations with leverage one:
##   46043

I log the dependent variable and still only a small amount of the observations fall on the dotted line. This violates the Normality of Residuals assumption.

Multicolinearity? (remove strongly correlated variables)

pavements_model_2<-lm(PCI~RoadFunction+OneWay,data=pavements)
vif(pavements_model_2)
##                  GVIF Df GVIF^(1/(2*Df))
## RoadFunction 1.025782 12        1.001061
## OneWay       1.025782  1        1.012809

I removed the oneway and got some large GVIF (16.?) so then I removed maintenance responsibility and got a GVIF of 1 which is less than 10 so the variables are not correlated and since its less than 5 it is not problematic.