library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'ggplot2' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lmtest)
## Warning: package 'lmtest' was built under R version 4.4.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.4.3
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(MASS)
## Warning: package 'MASS' was built under R version 4.4.3
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select
library(car)
## Warning: package 'car' was built under R version 4.4.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.4.3
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
library(ggplot2)
library(dplyr)

##Data:

pavements<-read.csv("Pavements_3192083553624189959.csv")

##Lets expolore the data:

str(pavements)
## 'data.frame':    97972 obs. of  21 variables:
##  $ OBJECTID                 : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ CartID                   : int  316831 208864 504023 428180 301427 424588 431607 307451 401539 844494 ...
##  $ MSAG_Name                : chr  "PVT RD AT US HWY 281 N" "TAMARON PASS" "ROOSEVELT AVE" "BENTON CITY RD" ...
##  $ FromStreet               : chr  "TBD" "TBD" "GENEVIEVE DR" "TBD" ...
##  $ ToStreet                 : chr  "TBD" "TBD" "WARE BLVD" "TBD" ...
##  $ Jurisdiction             : chr  "San Antonio" "ETJ San Antonio" "San Antonio" "ETJ San Antonio" ...
##  $ District                 : chr  "9" "0" "3" "0" ...
##  $ LengthFeet               : num  168 1159 465 7504 655 ...
##  $ InstallDate              : chr  "1/1/2000 0:00" "1/1/2000 0:00" "1/1/2000 0:00" "1/1/2000 0:00" ...
##  $ PCI                      : num  87 87 87 87 87 ...
##  $ ROW_Type                 : chr  "Street" "Street" "Street" "Street" ...
##  $ Surface_Type             : chr  "TBD" "TBD" "TBD" "TBD" ...
##  $ PavementWidth            : num  28.5 28.5 28.5 28.5 28.5 28.5 52 30 26 28.5 ...
##  $ SpeedLimit               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ OneWay                   : chr  "No" "No" "No" "No" ...
##  $ Owner                    : chr  "Private" "Bexar County" "TXDOT" "Bexar County" ...
##  $ RoadFunction             : chr  "Local" "Local" "Principal" "Local" ...
##  $ MaintenanceResponsibility: chr  "Private" "Bexar County" "TXDOT" "Bexar County" ...
##  $ Status                   : chr  "TBD" "TBD" "TBD" "TBD" ...
##  $ Stage                    : chr  "Existing" "Existing" "Existing" "Existing" ...
##  $ Shape__Length            : num  168 1159 465 7504 655 ...
summary(pavements)
##     OBJECTID         CartID         MSAG_Name          FromStreet       
##  Min.   :    1   Min.   : 200591   Length:97972       Length:97972      
##  1st Qu.:24494   1st Qu.: 316441   Class :character   Class :character  
##  Median :48987   Median : 422976   Mode  :character   Mode  :character  
##  Mean   :48987   Mean   :1442679                                        
##  3rd Qu.:73479   3rd Qu.: 518259                                        
##  Max.   :97972   Max.   :8489895                                        
##                                                                         
##    ToStreet         Jurisdiction         District           LengthFeet       
##  Length:97972       Length:97972       Length:97972       Min.   :    1.701  
##  Class :character   Class :character   Class :character   1st Qu.:  264.246  
##  Mode  :character   Mode  :character   Mode  :character   Median :  372.245  
##                                                           Mean   :  576.165  
##                                                           3rd Qu.:  690.861  
##                                                           Max.   :20648.482  
##                                                           NA's   :87         
##  InstallDate             PCI          ROW_Type         Surface_Type      
##  Length:97972       Min.   : 0.00   Length:97972       Length:97972      
##  Class :character   1st Qu.:76.60   Class :character   Class :character  
##  Mode  :character   Median :87.00   Mode  :character   Mode  :character  
##                     Mean   :78.86                                        
##                     3rd Qu.:87.00                                        
##                     Max.   :99.99                                        
##                                                                          
##  PavementWidth     SpeedLimit        OneWay             Owner          
##  Min.   : 1.00   Min.   : 0.000   Length:97972       Length:97972      
##  1st Qu.:28.00   1st Qu.: 0.000   Class :character   Class :character  
##  Median :28.50   Median : 0.000   Mode  :character   Mode  :character  
##  Mean   :28.88   Mean   : 3.725                                        
##  3rd Qu.:28.50   3rd Qu.: 0.000                                        
##  Max.   :96.00   Max.   :60.000                                        
##                                                                        
##  RoadFunction       MaintenanceResponsibility    Status         
##  Length:97972       Length:97972              Length:97972      
##  Class :character   Class :character          Class :character  
##  Mode  :character   Mode  :character          Mode  :character  
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##     Stage           Shape__Length      
##  Length:97972       Min.   :    1.701  
##  Class :character   1st Qu.:  265.014  
##  Mode  :character   Median :  373.017  
##                     Mean   :  577.332  
##                     3rd Qu.:  692.186  
##                     Max.   :20648.482  
## 

Pavement Condition Index (PCI) is a measurement of the rideability of a street. This measurement is obtained by a vehicle equipped with tools driving the streets and collecting data. The index spans from 0, extremely poor conditions, to 100, newly constructed roadway. Municipalities use this range to grade the quality of their infrastructure. The City of San Antonio is striving for an average PCI of 70.

##Lets clean the data:

Need to remove the “TBD” from the maintenance responsibility.

pavements_MR_clean<-pavements%>% filter(MaintenanceResponsibility!="TBD")

Need to remove the “TBD” from the surface type.

pavements_ST_clean<-pavements_MR_clean%>%filter(Surface_Type!="TBD")

Need to remove the “TBD” from the road function.

pavements_RF_clean<-pavements_ST_clean%>%filter(RoadFunction!="TBD")

Need to remove the 0 from the speed limit.

pavements_clean<-pavements_RF_clean%>%filter(SpeedLimit>0)

##Lets create the model

pavements_model<-lm(PCI~MaintenanceResponsibility+Surface_Type+RoadFunction+SpeedLimit,data=pavements_clean)

##Lets check the model to the assumptions

  1. Linearity (plot and raintest)
plot(pavements_model,which=1)

raintest(pavements_model)
## 
##  Rainbow test
## 
## data:  pavements_model
## Rain = 0.95153, df1 = 5771, df2 = 5748, p-value = 0.9703

Plot: The fitted (red) line is nearly straight, but there is an abnormality at around the 72 mark. Large dataset do not need to be completetly linear. the model is mostly linear.
Rainbow Test: The p-value for the rainbow test is 0.9703. This value is high which means the data is linear.

  1. Independence of errors (durbin-watson)
durbinWatsonTest(pavements_model)
##  lag Autocorrelation D-W Statistic p-value
##    1   -0.0006985343      2.001104   0.966
##  Alternative hypothesis: rho != 0

Durbin-Watson Test: The p-value is 0.996 which is greater than 0.05. This means the errors are independent. Also, the D-W is 2.001104 which is really good since it over 2. This passes the Independence of Errors assumption.

  1. Homoscedasticity (plot, bptest)
plot(pavements_model, which=3)
## Warning: not plotting observations with leverage one:
##   6544, 8138, 9696, 9867

bptest(pavements_model)
## 
##  studentized Breusch-Pagan test
## 
## data:  pavements_model
## BP = 175.66, df = 21, p-value < 2.2e-16

Plot: The red line has a suddden drop and the points are scattered unevenly so the variance of the model is not the same across the different levels of the model. This violates the Homoscedasticity assumption. Breusch-Pagan Test: The p-value is 0.00000000000000022. This is a lot lower than 0.05 meaning the model is heteroscedastic. This supports the violation of the Homoscedasticity assumption.

  1. Normality of residuals (QQ plot, Kolmogorov-Smirnov test)
plot(pavements_model, which=2)
## Warning: not plotting observations with leverage one:
##   6544, 8138, 9696, 9867

ks.test(pavements_clean$PCI,"pnorm")
## Warning in ks.test.default(pavements_clean$PCI, "pnorm"): ties should not be
## present for the one-sample Kolmogorov-Smirnov test
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  pavements_clean$PCI
## D = 0.99451, p-value < 2.2e-16
## alternative hypothesis: two-sided

Plot: About half of the observations fall on the dotted line. This means the residuals are not normally distributed. This violates the Normality of Residuals assumption. Kolmogorov-Smirnov test: This test results in p-value of 2.2e-16 so the variable is not normally distributed.

  1. No multicolinarity (VIF, cor)
cor(pavements_clean$PCI,pavements_clean$SpeedLimit)
## [1] 0.01662207
cor.test(pavements_clean$PCI,pavements_clean$SpeedLimit,method="pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  pavements_clean$PCI and pavements_clean$SpeedLimit
## t = 1.7859, df = 11540, p-value = 0.07415
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.001622245  0.034855320
## sample estimates:
##        cor 
## 0.01662207

The correlation between PCI and speed limit is 1.6% which is very low so the variables are not correlated and passes the multicolinarity assumptions. Running the Pearson test, the p-value equals 0.07415 which is higher than 0.05 so the correlation would have been by chance, hence the low correlation of the variables.

##Which assumption are volated

The model violates the assumptions of homoscedasticity and normality of residuals.

##How are the assumptions to be mitigated

  1. Hetroscedascity: square root the dependent variable
pavements_model_sqrt<-lm(sqrt(PCI)~MaintenanceResponsibility+Surface_Type+RoadFunction+SpeedLimit,data=pavements_clean)
plot(pavements_model_sqrt, which=3)
## Warning: not plotting observations with leverage one:
##   6544, 8138, 9696, 9867

bptest(pavements_model_sqrt)
## 
##  studentized Breusch-Pagan test
## 
## data:  pavements_model_sqrt
## BP = 210.75, df = 21, p-value < 2.2e-16

Plot: The red line has a sudden abnormality so the variance of the model is not the same across the different levels of the model. Homoscedasticity is not achievable. Breusch-Pagan Test: The p-value is 2.2e-16. This is a lot lower than 0.05 meaning the model is heteroscedastic. Homoscedasticity is not achievable.

  1. Residuals not normal? square root the dependent variable
plot(pavements_model_sqrt, which=2)
## Warning: not plotting observations with leverage one:
##   6544, 8138, 9696, 9867

Plot: About a quarter of the observations fall on the dotted line which is worse then the original model. Since the normality of residuals is not achievable, a generalized linear model should be used.

#Generalized Linear Model

pavements_glm<-glm(sqrt(PCI)~MaintenanceResponsibility+Surface_Type+RoadFunction+SpeedLimit,data=pavements_clean,family = gaussian)

#use the table to identify the baseline/missing category 
table(pavements_clean$MaintenanceResponsibility)
## 
##              Balcones Heights                  Bexar County 
##                             1                            13 
##                      Converse      COSA - Public Works Dept 
##                             2                         11385 
##                Ft Sam Houston                         Kirby 
##                             3                             1 
## Port Authority of San Antonio                       Private 
##                            35                            15 
##     Private - Gated Community               Property Owners 
##                             3                             4 
##                       Schertz                         TXDOT 
##                             1                            75 
##                     Windcrest 
##                             4
table(pavements_clean$Surface_Type)
## 
##          AAC Overlay  AC Asphalt Concrete      Brick and Block 
##                    1                11464                    9 
##                Earth               Gravel PCC Jointed Concrete 
##                    2                    2                   64
table(pavements_clean$RoadFunction)
## 
##     Alley Collector     Local     Minor Principal 
##         4       974      8084      1829       651
summary(pavements_glm)
## 
## Call:
## glm(formula = sqrt(PCI) ~ MaintenanceResponsibility + Surface_Type + 
##     RoadFunction + SpeedLimit, family = gaussian, data = pavements_clean)
## 
## Coefficients: (1 not defined because of singularities)
##                                                         Estimate Std. Error
## (Intercept)                                             0.239769   2.086021
## MaintenanceResponsibilityBexar County                   8.290192   1.522803
## MaintenanceResponsibilityConverse                       4.984095   1.797041
## MaintenanceResponsibilityCOSA - Public Works Dept       8.626426   1.467388
## MaintenanceResponsibilityFt Sam Houston                 8.267892   1.694080
## MaintenanceResponsibilityKirby                          7.409785   2.075006
## MaintenanceResponsibilityPort Authority of San Antonio  1.950740   1.488314
## MaintenanceResponsibilityPrivate                        6.939247   1.515770
## MaintenanceResponsibilityPrivate - Gated Community      6.093449   1.694548
## MaintenanceResponsibilityProperty Owners                8.916565   1.798099
## MaintenanceResponsibilitySchertz                        8.365416   2.074713
## MaintenanceResponsibilityTXDOT                          7.776121   1.477135
## MaintenanceResponsibilityWindcrest                      0.175497   1.640156
## Surface_TypeAC Asphalt Concrete                        -0.934948   1.466704
## Surface_TypeBrick and Block                             0.073299   1.547058
## Surface_TypeEarth                                      -0.164661   2.074168
## Surface_TypeGravel                                      0.055951   1.796465
## Surface_TypePCC Jointed Concrete                       -1.066417   1.478277
## RoadFunctionCollector                                   0.359474   0.082036
## RoadFunctionLocal                                       0.125478   0.072525
## RoadFunctionMinor                                       0.189855   0.068987
## RoadFunctionPrincipal                                         NA         NA
## SpeedLimit                                              0.011190   0.005857
##                                                        t value Pr(>|t|)    
## (Intercept)                                              0.115 0.908494    
## MaintenanceResponsibilityBexar County                    5.444 5.32e-08 ***
## MaintenanceResponsibilityConverse                        2.774 0.005555 ** 
## MaintenanceResponsibilityCOSA - Public Works Dept        5.879 4.25e-09 ***
## MaintenanceResponsibilityFt Sam Houston                  4.880 1.07e-06 ***
## MaintenanceResponsibilityKirby                           3.571 0.000357 ***
## MaintenanceResponsibilityPort Authority of San Antonio   1.311 0.189984    
## MaintenanceResponsibilityPrivate                         4.578 4.74e-06 ***
## MaintenanceResponsibilityPrivate - Gated Community       3.596 0.000325 ***
## MaintenanceResponsibilityProperty Owners                 4.959 7.19e-07 ***
## MaintenanceResponsibilitySchertz                         4.032 5.56e-05 ***
## MaintenanceResponsibilityTXDOT                           5.264 1.43e-07 ***
## MaintenanceResponsibilityWindcrest                       0.107 0.914791    
## Surface_TypeAC Asphalt Concrete                         -0.637 0.523846    
## Surface_TypeBrick and Block                              0.047 0.962211    
## Surface_TypeEarth                                       -0.079 0.936727    
## Surface_TypeGravel                                       0.031 0.975154    
## Surface_TypePCC Jointed Concrete                        -0.721 0.470683    
## RoadFunctionCollector                                    4.382 1.19e-05 ***
## RoadFunctionLocal                                        1.730 0.083634 .  
## RoadFunctionMinor                                        2.752 0.005932 ** 
## RoadFunctionPrincipal                                       NA       NA    
## SpeedLimit                                               1.911 0.056078 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 2.150953)
## 
##     Null deviance: 26927  on 11541  degrees of freedom
## Residual deviance: 24779  on 11520  degrees of freedom
## AIC: 41619
## 
## Number of Fisher Scoring iterations: 2

estimated rough p-value: 26927-24779=2148 11541-11520=21 21>2148 so the model is significant

Intercept: Balcones Heights, ACC Overlay, Alley, No speed limit

GLM shows the independent variables DO NOT influence the dependent variable

could this be because balcones heights doesn’t have ACC overlay. ACC overlay is only in cosa responsibility since it show the street received an rehabilitation

#H1: The City of San Antonio’s maintenance responsibility results in higher PCI.
Baseline: Balcones Heights Estimate Std. Error t value Pr(>|t|) MaintenanceResponsibilityCOSA - Public Works Dept 8.626426 1.467388 5.879 4.25e-09 ***
COSA MR is significantly better than BH on PCI H1=true

#H2: Roadways constructed of concrete pavement result in higher PCI. Baseline: ACC Overlay Estimate Std. Error t value Pr(>|t|) Surface_TypePCC Jointed Concrete -1.066417 1.478277 -0.721 0.470683
ST is not significant on PCI.
H2=false

#H3: Roadways categorized as collector streets result in lower PCI. Baseline: Alley Estimate Std. Error t value Pr(>|t|) RoadFunctionCollector 0.359474 0.082036 4.382 1.19e-05 ***
Collector RF is significantly better than Alley on PCI. Principal function is correlated with another variable. H3=true

#H4: Roadways with higher speed limits result in lower PCI. . Baseline: No speed limit Estimate Std. Error t value Pr(>|t|) SpeedLimit 0.011190 0.005857 1.911 0.056078 .
Speed limit is insignificant on PCI H4=false