library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'ggplot2' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lmtest)
## Warning: package 'lmtest' was built under R version 4.4.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.4.3
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(MASS)
## Warning: package 'MASS' was built under R version 4.4.3
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
library(car)
## Warning: package 'car' was built under R version 4.4.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.4.3
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(ggplot2)
library(dplyr)
##Data:
pavements<-read.csv("Pavements_3192083553624189959.csv")
##Lets expolore the data:
str(pavements)
## 'data.frame': 97972 obs. of 21 variables:
## $ OBJECTID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ CartID : int 316831 208864 504023 428180 301427 424588 431607 307451 401539 844494 ...
## $ MSAG_Name : chr "PVT RD AT US HWY 281 N" "TAMARON PASS" "ROOSEVELT AVE" "BENTON CITY RD" ...
## $ FromStreet : chr "TBD" "TBD" "GENEVIEVE DR" "TBD" ...
## $ ToStreet : chr "TBD" "TBD" "WARE BLVD" "TBD" ...
## $ Jurisdiction : chr "San Antonio" "ETJ San Antonio" "San Antonio" "ETJ San Antonio" ...
## $ District : chr "9" "0" "3" "0" ...
## $ LengthFeet : num 168 1159 465 7504 655 ...
## $ InstallDate : chr "1/1/2000 0:00" "1/1/2000 0:00" "1/1/2000 0:00" "1/1/2000 0:00" ...
## $ PCI : num 87 87 87 87 87 ...
## $ ROW_Type : chr "Street" "Street" "Street" "Street" ...
## $ Surface_Type : chr "TBD" "TBD" "TBD" "TBD" ...
## $ PavementWidth : num 28.5 28.5 28.5 28.5 28.5 28.5 52 30 26 28.5 ...
## $ SpeedLimit : int 0 0 0 0 0 0 0 0 0 0 ...
## $ OneWay : chr "No" "No" "No" "No" ...
## $ Owner : chr "Private" "Bexar County" "TXDOT" "Bexar County" ...
## $ RoadFunction : chr "Local" "Local" "Principal" "Local" ...
## $ MaintenanceResponsibility: chr "Private" "Bexar County" "TXDOT" "Bexar County" ...
## $ Status : chr "TBD" "TBD" "TBD" "TBD" ...
## $ Stage : chr "Existing" "Existing" "Existing" "Existing" ...
## $ Shape__Length : num 168 1159 465 7504 655 ...
summary(pavements)
## OBJECTID CartID MSAG_Name FromStreet
## Min. : 1 Min. : 200591 Length:97972 Length:97972
## 1st Qu.:24494 1st Qu.: 316441 Class :character Class :character
## Median :48987 Median : 422976 Mode :character Mode :character
## Mean :48987 Mean :1442679
## 3rd Qu.:73479 3rd Qu.: 518259
## Max. :97972 Max. :8489895
##
## ToStreet Jurisdiction District LengthFeet
## Length:97972 Length:97972 Length:97972 Min. : 1.701
## Class :character Class :character Class :character 1st Qu.: 264.246
## Mode :character Mode :character Mode :character Median : 372.245
## Mean : 576.165
## 3rd Qu.: 690.861
## Max. :20648.482
## NA's :87
## InstallDate PCI ROW_Type Surface_Type
## Length:97972 Min. : 0.00 Length:97972 Length:97972
## Class :character 1st Qu.:76.60 Class :character Class :character
## Mode :character Median :87.00 Mode :character Mode :character
## Mean :78.86
## 3rd Qu.:87.00
## Max. :99.99
##
## PavementWidth SpeedLimit OneWay Owner
## Min. : 1.00 Min. : 0.000 Length:97972 Length:97972
## 1st Qu.:28.00 1st Qu.: 0.000 Class :character Class :character
## Median :28.50 Median : 0.000 Mode :character Mode :character
## Mean :28.88 Mean : 3.725
## 3rd Qu.:28.50 3rd Qu.: 0.000
## Max. :96.00 Max. :60.000
##
## RoadFunction MaintenanceResponsibility Status
## Length:97972 Length:97972 Length:97972
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Stage Shape__Length
## Length:97972 Min. : 1.701
## Class :character 1st Qu.: 265.014
## Mode :character Median : 373.017
## Mean : 577.332
## 3rd Qu.: 692.186
## Max. :20648.482
##
Pavement Condition Index (PCI) is a measurement of the rideability of a street. This measurement is obtained by a vehicle equipped with tools driving the streets and collecting data. The index spans from 0, extremely poor conditions, to 100, newly constructed roadway. Municipalities use this range to grade the quality of their infrastructure. The City of San Antonio is striving for an average PCI of 70.
##Lets clean the data:
Need to remove the “TBD” from the maintenance responsibility.
pavements_MR_clean<-pavements%>% filter(MaintenanceResponsibility!="TBD")
Need to remove the “TBD” from the surface type.
pavements_ST_clean<-pavements_MR_clean%>%filter(Surface_Type!="TBD")
Need to remove the “TBD” from the road function.
pavements_RF_clean<-pavements_ST_clean%>%filter(RoadFunction!="TBD")
Need to remove the 0 from the speed limit.
pavements_clean<-pavements_RF_clean%>%filter(SpeedLimit>0)
##Lets create the model
pavements_model<-lm(PCI~MaintenanceResponsibility+Surface_Type+RoadFunction+SpeedLimit,data=pavements_clean)
##Lets check the model to the assumptions
plot(pavements_model,which=1)
raintest(pavements_model)
##
## Rainbow test
##
## data: pavements_model
## Rain = 0.95153, df1 = 5771, df2 = 5748, p-value = 0.9703
Plot: The fitted (red) line is nearly straight, but there is an
abnormality at around the 72 mark. Large dataset do not need to be
completetly linear. the model is mostly linear.
Rainbow Test: The p-value for the rainbow test is 0.9703. This value is
high which means the data is linear.
durbinWatsonTest(pavements_model)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.0006985343 2.001104 0.966
## Alternative hypothesis: rho != 0
Durbin-Watson Test: The p-value is 0.996 which is greater than 0.05. This means the errors are independent. Also, the D-W is 2.001104 which is really good since it over 2. This passes the Independence of Errors assumption.
plot(pavements_model, which=3)
## Warning: not plotting observations with leverage one:
## 6544, 8138, 9696, 9867
bptest(pavements_model)
##
## studentized Breusch-Pagan test
##
## data: pavements_model
## BP = 175.66, df = 21, p-value < 2.2e-16
Plot: The red line has a suddden drop and the points are scattered unevenly so the variance of the model is not the same across the different levels of the model. This violates the Homoscedasticity assumption. Breusch-Pagan Test: The p-value is 0.00000000000000022. This is a lot lower than 0.05 meaning the model is heteroscedastic. This supports the violation of the Homoscedasticity assumption.
plot(pavements_model, which=2)
## Warning: not plotting observations with leverage one:
## 6544, 8138, 9696, 9867
ks.test(pavements_clean$PCI,"pnorm")
## Warning in ks.test.default(pavements_clean$PCI, "pnorm"): ties should not be
## present for the one-sample Kolmogorov-Smirnov test
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: pavements_clean$PCI
## D = 0.99451, p-value < 2.2e-16
## alternative hypothesis: two-sided
Plot: About half of the observations fall on the dotted line. This means the residuals are not normally distributed. This violates the Normality of Residuals assumption. Kolmogorov-Smirnov test: This test results in p-value of 2.2e-16 so the variable is not normally distributed.
cor(pavements_clean$PCI,pavements_clean$SpeedLimit)
## [1] 0.01662207
cor.test(pavements_clean$PCI,pavements_clean$SpeedLimit,method="pearson")
##
## Pearson's product-moment correlation
##
## data: pavements_clean$PCI and pavements_clean$SpeedLimit
## t = 1.7859, df = 11540, p-value = 0.07415
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.001622245 0.034855320
## sample estimates:
## cor
## 0.01662207
The correlation between PCI and speed limit is 1.6% which is very low so the variables are not correlated and passes the multicolinarity assumptions. Running the Pearson test, the p-value equals 0.07415 which is higher than 0.05 so the correlation would have been by chance, hence the low correlation of the variables.
##Which assumption are volated
The model violates the assumptions of homoscedasticity and normality of residuals.
##How are the assumptions to be mitigated
pavements_model_sqrt<-lm(sqrt(PCI)~MaintenanceResponsibility+Surface_Type+RoadFunction+SpeedLimit,data=pavements_clean)
plot(pavements_model_sqrt, which=3)
## Warning: not plotting observations with leverage one:
## 6544, 8138, 9696, 9867
bptest(pavements_model_sqrt)
##
## studentized Breusch-Pagan test
##
## data: pavements_model_sqrt
## BP = 210.75, df = 21, p-value < 2.2e-16
Plot: The red line has a sudden abnormality so the variance of the model is not the same across the different levels of the model. Homoscedasticity is not achievable. Breusch-Pagan Test: The p-value is 2.2e-16. This is a lot lower than 0.05 meaning the model is heteroscedastic. Homoscedasticity is not achievable.
plot(pavements_model_sqrt, which=2)
## Warning: not plotting observations with leverage one:
## 6544, 8138, 9696, 9867
Plot: About a quarter of the observations fall on the dotted line which
is worse then the original model. Since the normality of residuals is
not achievable, a generalized linear model should be used.
#Generalized Linear Model
pavements_glm<-glm(sqrt(PCI)~MaintenanceResponsibility+Surface_Type+RoadFunction+SpeedLimit,data=pavements_clean,family = gaussian)
#use the table to identify the baseline/missing category
table(pavements_clean$MaintenanceResponsibility)
##
## Balcones Heights Bexar County
## 1 13
## Converse COSA - Public Works Dept
## 2 11385
## Ft Sam Houston Kirby
## 3 1
## Port Authority of San Antonio Private
## 35 15
## Private - Gated Community Property Owners
## 3 4
## Schertz TXDOT
## 1 75
## Windcrest
## 4
table(pavements_clean$Surface_Type)
##
## AAC Overlay AC Asphalt Concrete Brick and Block
## 1 11464 9
## Earth Gravel PCC Jointed Concrete
## 2 2 64
table(pavements_clean$RoadFunction)
##
## Alley Collector Local Minor Principal
## 4 974 8084 1829 651
summary(pavements_glm)
##
## Call:
## glm(formula = sqrt(PCI) ~ MaintenanceResponsibility + Surface_Type +
## RoadFunction + SpeedLimit, family = gaussian, data = pavements_clean)
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error
## (Intercept) 0.239769 2.086021
## MaintenanceResponsibilityBexar County 8.290192 1.522803
## MaintenanceResponsibilityConverse 4.984095 1.797041
## MaintenanceResponsibilityCOSA - Public Works Dept 8.626426 1.467388
## MaintenanceResponsibilityFt Sam Houston 8.267892 1.694080
## MaintenanceResponsibilityKirby 7.409785 2.075006
## MaintenanceResponsibilityPort Authority of San Antonio 1.950740 1.488314
## MaintenanceResponsibilityPrivate 6.939247 1.515770
## MaintenanceResponsibilityPrivate - Gated Community 6.093449 1.694548
## MaintenanceResponsibilityProperty Owners 8.916565 1.798099
## MaintenanceResponsibilitySchertz 8.365416 2.074713
## MaintenanceResponsibilityTXDOT 7.776121 1.477135
## MaintenanceResponsibilityWindcrest 0.175497 1.640156
## Surface_TypeAC Asphalt Concrete -0.934948 1.466704
## Surface_TypeBrick and Block 0.073299 1.547058
## Surface_TypeEarth -0.164661 2.074168
## Surface_TypeGravel 0.055951 1.796465
## Surface_TypePCC Jointed Concrete -1.066417 1.478277
## RoadFunctionCollector 0.359474 0.082036
## RoadFunctionLocal 0.125478 0.072525
## RoadFunctionMinor 0.189855 0.068987
## RoadFunctionPrincipal NA NA
## SpeedLimit 0.011190 0.005857
## t value Pr(>|t|)
## (Intercept) 0.115 0.908494
## MaintenanceResponsibilityBexar County 5.444 5.32e-08 ***
## MaintenanceResponsibilityConverse 2.774 0.005555 **
## MaintenanceResponsibilityCOSA - Public Works Dept 5.879 4.25e-09 ***
## MaintenanceResponsibilityFt Sam Houston 4.880 1.07e-06 ***
## MaintenanceResponsibilityKirby 3.571 0.000357 ***
## MaintenanceResponsibilityPort Authority of San Antonio 1.311 0.189984
## MaintenanceResponsibilityPrivate 4.578 4.74e-06 ***
## MaintenanceResponsibilityPrivate - Gated Community 3.596 0.000325 ***
## MaintenanceResponsibilityProperty Owners 4.959 7.19e-07 ***
## MaintenanceResponsibilitySchertz 4.032 5.56e-05 ***
## MaintenanceResponsibilityTXDOT 5.264 1.43e-07 ***
## MaintenanceResponsibilityWindcrest 0.107 0.914791
## Surface_TypeAC Asphalt Concrete -0.637 0.523846
## Surface_TypeBrick and Block 0.047 0.962211
## Surface_TypeEarth -0.079 0.936727
## Surface_TypeGravel 0.031 0.975154
## Surface_TypePCC Jointed Concrete -0.721 0.470683
## RoadFunctionCollector 4.382 1.19e-05 ***
## RoadFunctionLocal 1.730 0.083634 .
## RoadFunctionMinor 2.752 0.005932 **
## RoadFunctionPrincipal NA NA
## SpeedLimit 1.911 0.056078 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 2.150953)
##
## Null deviance: 26927 on 11541 degrees of freedom
## Residual deviance: 24779 on 11520 degrees of freedom
## AIC: 41619
##
## Number of Fisher Scoring iterations: 2
estimated rough p-value: 26927-24779=2148 11541-11520=21 21>2148 so the model is significant
Intercept: Balcones Heights, ACC Overlay, Alley, No speed limit
GLM shows the independent variables DO NOT influence the dependent variable
could this be because balcones heights doesn’t have ACC overlay. ACC overlay is only in cosa responsibility since it show the street received an rehabilitation
#H1: The City of San Antonio’s maintenance responsibility results in
higher PCI.
Baseline: Balcones Heights Estimate Std. Error t value Pr(>|t|)
MaintenanceResponsibilityCOSA - Public Works Dept 8.626426 1.467388
5.879 4.25e-09 ***
COSA MR is significantly better than BH on PCI H1=true
#H2: Roadways constructed of concrete pavement result in higher PCI.
Baseline: ACC Overlay Estimate Std. Error t value Pr(>|t|)
Surface_TypePCC Jointed Concrete -1.066417 1.478277 -0.721
0.470683
ST is not significant on PCI.
H2=false
#H3: Roadways categorized as collector streets result in lower PCI.
Baseline: Alley Estimate Std. Error t value Pr(>|t|)
RoadFunctionCollector 0.359474 0.082036 4.382 1.19e-05 ***
Collector RF is significantly better than Alley on PCI. Principal
function is correlated with another variable. H3=true
#H4: Roadways with higher speed limits result in lower PCI. .
Baseline: No speed limit Estimate Std. Error t value Pr(>|t|)
SpeedLimit 0.011190 0.005857 1.911 0.056078 .
Speed limit is insignificant on PCI H4=false