library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
wildlife <- read_delim("./Urban_Wildlife_Response.csv", delim = ",")
## Rows: 6385 Columns: 24
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): DT_Initial, DT_Response, Response_Time, Borough, Property, Locatio...
## dbl (3): Response_Duration, Num_of_Animals, Hours_Monitoring
## lgl (4): PEP_Response, Animal_Monitored, Police_Response, ESU_Response
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The following columns are columns that, prior to reading the documentation, I had some questions on.
Difference between Property and Location
311SR_Number
PEP_Response
ESU_Response
ACC_Intake_Number
Part of the reason for my the confusion is because I’m not familiar with NYC services - 311, PEP, ESU are all services provided by the city, and ACC is an animal organization. Once I was able to look at the documentation and look up what the acronyms were, the columns were named logically. Overall, I think the folks who set up this dataset did a good job ensuring the columns and associated values were named in a straightforward manner.
As far as the difference in Property and Location, looking at the values populated in those columns, the difference is pretty clear. Property is place, the location is a specific location within that place. I will, however, say that the documentation differs from what the reality of the location column shows. According to the documentation, the location value should be the “Cross street of location of rescue”, when in actuality, the location shows values like “the carousel” or “Giovanelli Playground”. Reading the documentation, you would expect a different sort of information than the information gathered.
Even after reading the documentation, something that is not clear to me is why for columns “Animal_Class” and “Age” there are so many variations of the same entry. I have already gone through and corrected “Age” so that “Adult, Infant” is the same as “Infant, Adult”, as those responses are functionally the same. “Animal_Class” has similar data issues. There is also in some of the “Animal_Class” entries, an additional code of “RVS” which took some additional digging for me to find (it stands for “Rabies Vector Species”, which is a species that can carry and spread rabies to other animals and humans, if you’re curious).
As described above, the field “Animal_Class” is not being standardized when new data is input. This leads to duplication, muddying the data. For example, it’s difficult to see the true difference between “small mammals - RVS” versus “small mammals - non RVS” due to each having 2 different bars in the below graph. To properly compare the two, you have to either modify the source data or do a lot of cleanup in the code. Depending on how the records are input, the input could be changed to a checklist, where any of the present animal types are selected from a list, and then the system inputs the value into the field. Because the system is creating the input values, they should be consistent across records.
wildlife |>
filter(str_detect(Animal_Class, "Small Mammals")) |>
ggplot() +
geom_bar(mapping = aes(y = Animal_Class), fill = '#DA70D6') +
theme_minimal()
There is also quite a bit of muddying of the data with having the RVS/Non-RVS tag within the “Animal_Class” field. To simplify the data, it would make more sense to have a separate True/False column which determined whether the animals were of an RVS or not. The only downside to adding a separate column is that sometimes multiple types of animals are part of a single response, and there could be non-RVS and RVS on the same response. This is likely not a huge concern given mammals are the only creatures capable of getting rabies, therefor the only category that would receive the RVS tag. I am sort of confused, however, because I would imagine most if not all of the “Domestic” animals would be susceptible to rabies. The exact difference between which mammals get the RVS tag and which do not is not something that I personally understand, though I’m sure the folks working with urban wildlife are very familiar with the difference.
There is a way to get past the issue of multiple animal types, some of which may have the RVS tag when others in the same response may not. By making each record not a single response, but instead a specific animal type (unique in age, class, species, and all dependent fields), then there would be no issue with the RVS tag being applied inaccurately if in a separate field. This would increase the records and also probably the work on the data entry side, which is likely why the data is set up the way it is, but it would allow for clearer analysis.
wildlife |>
filter(str_detect(Animal_Class, "non RVS")) |>
ggplot() +
geom_bar(mapping = aes(y = Animal_Class), fill = '#556B2F') +
theme_minimal()
In the Species_Status column, we have 4 options: Native, Invasive, Exotic, or Domestic. For some of the records, the species status was records as N/A. This appears to primarily be due to the animal not being found when the urban rangers arrived - it’s difficult to determine whether a bird was native or exotic without seeing it. It does raise the question of whether there are instances when an call comes in and the species status is recorded, even if the animal is not found.
wildlife |>
ggplot() +
geom_bar(mapping = aes(y = Species_Status), fill = '#560319') +
theme_minimal()
The Animal_Class column seems to have a rather robust set of values covering individual animal classes, with the only major class missing being invertebrates. A quick look through the data does show at least one instance of insects (wasps), so it’s not that these creatures can’t be called about. It would be interesting to know if there are just not calls about invertebrates (with the exception of the one wasp call), or if invertebrates are not considered part of the Urban Rangers’ scope of duty.
unique_animal_class <- unique(wildlife$Animal_Class_Neat)
unique_animal_class <- unique_animal_class[!str_detect(unique_animal_class, ",")]
unique_animal_class
## [1] "Birds" "Deer"
## [3] "Small Mammals-RVS" "Small Mammals-non RVS"
## [5] "Domestic" "Terrestrial Reptile or Amphibian"
## [7] "Raptors" "Fish-numerous quantity"
## [9] "Marine Reptiles" "Coyotes"
## [11] "Marine Mammals-whales/Dolphin" "Marine Mammals-seals only"
## [13] "Rare/Endangered/Dangerous" "Non Native Fish-(invasive)"
While the lack of invertebrates is the most obvious missing value, “Animal_Class” is one of the fields where combinations of values are sometimes entered. Given that there are 14 unique classes determined by the Urban Rangers, there are over 16,000 combinations. We only have 5 of those combinations in the current dataset. Given the number of combinations, there are many implicitly missing values.
unique_animal_class_combo <- unique(wildlife$Animal_Class_Neat)
unique_animal_class_combo <- unique_animal_class_combo[str_detect(unique_animal_class_combo, ",")]
unique_animal_class_combo
## [1] "Birds, Domestic"
## [2] "Domestic, Small Mammals-non RVS"
## [3] "Terrestrial Reptile or Amphibian, Fish-numerous quantity"
## [4] "Domestic, Small Mammals - RVS"
## [5] "Raptors, Small Mammals-RVS"
For ease, I have gone back to the original dataset at created a new column “Animal_Class_Neat” which cleans up the values in the “Animal_Class” column, based on the details in previous paragraphs.
Below is a box plot for the response time. The vast majority of response durations are between half an hour to 2 hours long. Based on the IQR of this column, I would consider anything past a few hours to be an outlier. Four hours is half a standard work day, and would be considered a significant time from a resource management standpoint, so I would probably set the “limit” there, though someone working with the Urban Rangers would be able to provide a more accurate time and reasoning.
response_q1 <- quantile(wildlife$Response_Duration, 0.25)
response_q2 <- quantile(wildlife$Response_Duration, 0.75)
response_q1
## 25%
## 0.5
response_q2
## 75%
## 2
wildlife |>
ggplot() +
geom_boxplot(mapping = aes(y = Response_Duration)) +
labs(title="Response Time") +
theme_minimal()