library (tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)

During the last homework, I found issues with my dataset that limited me to choosing variables I was not as interested in. This new dataset, also produced from ArcGIS, should have all streets, their pavement condition scores, and a column assigning a numerical ID associated with the action (or lack thereof) the city responded with. 1 = no project 2 = preservation project 3 = rehabilitation project 4 = preservation AND rehabilitation projects

Projects=read_excel("MergeTestOct21.xlsx")
ProjectsFiltered <- na.omit(Projects)

There was a large error message, but it’s clear that that is because R expected the Fiscal Year column to contain numerical values, when all of the non-project streets contain “none” in that space. I originally accounted for that in the chunk, but later went back and replaced all of those “none” entries with the number 1, since the year is really only going to be used as a label.

cor(Projects$`Category ID`,Projects$WeightedPCI, use = "complete.obs")
## [1] 0.1271079

I checked the data, and all values on the selected columns are valid. I will do some other exercises with it to hopefully understand what is happening.

cor.test(Projects$`Category ID`,Projects$WeightedPCI, use = "complete.obs")
## 
##  Pearson's product-moment correlation
## 
## data:  Projects$`Category ID` and Projects$WeightedPCI
## t = 7.8589, df = 3761, p-value = 5.021e-15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.09554333 0.15841710
## sample estimates:
##       cor 
## 0.1271079
pairs(~ProjectsFiltered$`Category ID`+ProjectsFiltered$WeightedPCI+ProjectsFiltered$FiscalYear)

hist(ProjectsFiltered$WeightedPCI,breaks=50,probability = T)
lines(density(ProjectsFiltered$WeightedPCI),col="red",lwd=3)

After seeing a disproportionately-high number of streets listed with an almost zero PCI score, I finally checked even further to the original data in ArcGIS and confirmed that there were, in fact, a few streets in the original data that had no PCI value. I returned the data to it’s original form and elected to not count the NA’s.

cor(ProjectsFiltered$`Category ID`,ProjectsFiltered$WeightedPCI, method = "pearson", use = "complete.obs")
## [1] 0.1269629

After some additional errors, I checked a third time and found that there actually were some rows that were left blank instead of being assigned a category ID, so I fixed the issue and re-ran all of the tests.

shapiro.test(ProjectsFiltered$WeightedPCI)
## 
##  Shapiro-Wilk normality test
## 
## data:  ProjectsFiltered$WeightedPCI
## W = 0.9508, p-value < 2.2e-16

Things started making more sense with the plot below. I could at least visually tell that there is normality in the data, even if the tails diverge.

qqnorm(ProjectsFiltered$WeightedPCI, main = "Q-Q Plot of WeightedPCI")
qqline(ProjectsFiltered$WeightedPCI, col = "red", lwd = 2)

SqProjectsFiltered=sqrt(ProjectsFiltered$WeightedPCI)

I tried square rooting the data, but the resulting plot seemed like a step back.

qqnorm(SqProjectsFiltered, main = "Q-Q Plot of WeightedPCI")
qqline(SqProjectsFiltered, col = "red", lwd = 2)

LogWeightedPCI <- log(ProjectsFiltered$WeightedPCI + .01)

I tried to log the data, which also produced strange looking results I do not feel prepared to interpret at this time.

qqnorm(LogWeightedPCI, main = "Q-Q Plot of WeightedPCI")
qqline(LogWeightedPCI, col = "red", lwd = 2)

shapiro.test(LogWeightedPCI)
## 
##  Shapiro-Wilk normality test
## 
## data:  LogWeightedPCI
## W = 0.45124, p-value < 2.2e-16
cor(ProjectsFiltered$`Category ID`,LogWeightedPCI, method = "pearson")
## [1] 0.04492085

It is at this point I remembered that the Category ID is not really a numerical value in-spirit, so I did a box plot to reflect that and get my understanding of the situation back on track.

1 = no project 2 = preservation project 3 = rehabilitation project 4 = preservation AND rehabilitation projects

ggplot(ProjectsFiltered, aes(x = as.factor(`Category ID`), y = WeightedPCI)) +
  geom_boxplot()

This shed some light on the situation, but also demonstrated that I may have to use the data in a different way to get the picture I am looking for. For streets that were scheduled for both a preservation and rehabilitation project, I moved them into their own category. It may be best for me to consolidate that category with the rehabilitation category, and having the data shown through the lens of the “highest” level of treatment recieved:

1 = no project 2 = preservation project 3 = rehabilitation project

Conclusion: I attempted all three correlation methods earlier in this activity and I feel that this data was not prepared in a way to be interpreted effectively by them. The various other activities I did afterward gave me a better understanding of the data and I feel that future attempts to analyze it will be more effective once the appropriate changes are made.