library(tidyverse)
## Warning: package 'lubridate' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(pwrss)
##
## Attaching package: 'pwrss'
##
## The following object is masked from 'package:stats':
##
## power.t.test
library(effsize)
library(boot)
wildlife <- read_delim("./Urban_Wildlife_Response_Edited.csv", delim = ",")
## Rows: 6385 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): DT_Initial, DT_Response, Time_To_Respond, Borough, Property, Locat...
## dbl (3): Response_Duration, Num_of_Animals, Hours_Monitoring
## lgl (4): PEP_Response, Animal_Monitored, Police_Response, ESU_Response
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The following null hypothesis asserts that there is no difference in response time, regardless of the condition the animal(s) are in once found. If this null hypothesis is disproved, that would indicate that there is some difference in the average length of response.
\[ H_0 : \text{average response time is identical regardless of animal condition.} \]
To prevent errors, I have only included records that contain valid responses for the animal condition. The excluded columns are all either “N/A” or blank, most due to the animal not being found.
wildlife_filt <- wildlife |>
filter(Animal_Condition %in% c("Healthy", "Unhealthy", "Injured", "DOA"))
When running the ANOVA test (seen below), we get a P-value of <2e-16, which is incredibly small. This suggests that there is some difference in the average response length correlated to animal condition.
anova_test <- aov(Response_Duration ~ Animal_Condition, data = wildlife_filt)
summary(anova_test)
## Df Sum Sq Mean Sq F value Pr(>F)
## Animal_Condition 3 301 100.25 67.98 <2e-16 ***
## Residuals 5681 8377 1.47
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Given the values of the pairwise test, it is unlikely that any of these have similar averages.
pairwise.t.test(wildlife_filt$Response_Duration, wildlife_filt$Animal_Condition, p.adjust.method = "bonferroni")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: wildlife_filt$Response_Duration and wildlife_filt$Animal_Condition
##
## DOA Healthy Injured
## Healthy 1.6e-10 - -
## Injured < 2e-16 0.00032 -
## Unhealthy < 2e-16 < 2e-16 6.7e-07
##
## P value adjustment method: bonferroni
boot_ci <- function (v, func = median, conf = 0.95, n_iter = 100) {
boot_func <- \(x, i) func(x[i])
b <- boot(v, boot_func, R = n_iter)
b <- boot.ci(b, conf = conf, type = "perc")
return(c("lower" = b$percent[4],
"upper" = b$percent[5]))
}
wf_ci <- wildlife_filt |>
group_by(Animal_Condition) |>
summarise(ci_lower = boot_ci(Response_Duration, mean)['lower'],
mean_response_time = mean(Response_Duration),
ci_upper = boot_ci(Response_Duration, mean)['upper'])
wf_ci
## # A tibble: 4 × 4
## Animal_Condition ci_lower mean_response_time ci_upper
## <chr> <dbl> <dbl> <dbl>
## 1 DOA 0.914 1.00 1.11
## 2 Healthy 1.30 1.36 1.41
## 3 Injured 1.47 1.53 1.59
## 4 Unhealthy 1.71 1.76 1.85
wf_ci |>
ggplot() +
geom_errorbarh(mapping = aes(y = Animal_Condition,
xmin=ci_lower, xmax=ci_upper,
color = '95% C.I.'), height = 0.5) +
geom_point(mapping = aes(x = mean_response_time, y = Animal_Condition,
color = 'Group Mean'),
shape = '*',
size = 8) +
scale_x_continuous(breaks = seq(0.8, 2.2, by = 0.2)) +
scale_color_manual(values=c('black', 'forestgreen')) +
labs(title = "Average Response Duration by Animal Condition",
x = "Length of Response (in hours)",
y = "Animal Condition")
To further explore the differences in average response time across the different animal conditions, I’ve built a graph showing the 95% confidence interval. As seen above, there is no overlap in the confidence intervals for each of the animal condition statuses. Given the lack of overlap and the other statistics that have been run, the null hypothesis can be rejected - there is substantial evidence against the assertion that there is no difference in average response time based on animal condition. We can then assume that animal condition does have an impact on the average response time.
wildlife <- wildlife |>
mutate(DT_Initial = as.POSIXct(DT_Initial, format = "%m/%d/%Y %H:%M")) |>
mutate(Time_Initial = as.numeric(format(DT_Initial, "%H")) +
as.numeric(format(DT_Initial,"%M")) / 60) |>
mutate(Month_Initial = as.numeric(format(DT_Initial, "%m"))) |>
mutate(Day_Initial = as.numeric(format(DT_Initial, "%d"))) |>
mutate(MY_Initial = as.numeric(format(DT_Initial, "%Y%m"))) |>
mutate(Year_Initial = as.numeric(format(DT_Initial, "%Y"))) |>
mutate(DT_Response = as.POSIXct(DT_Response, format = "%m/%d/%Y %H:%M")) |>
mutate(Time_Response = as.numeric(format(DT_Response, "%H")) +
as.numeric(format(DT_Response,"%M")) / 60) |>
mutate(Month_Response = as.numeric(format(DT_Response, "%m"))) |>
mutate(Day_Response = as.numeric(format(DT_Response, "%d"))) |>
mutate(MY_Response = as.numeric(format(DT_Response, "%Y%m"))) |>
mutate(Year_Response = as.numeric(format(DT_Response, "%Y"))) |>
mutate(Time_To_Respond = as.POSIXct(DT_Response, format = "%H:%M")) |>
mutate(Time_To_Respond_Num = as.numeric(format(Time_To_Respond, "%H")) +
as.numeric(format(Time_To_Respond,"%M")) / 60)
class(wildlife$Month_Response)
## [1] "numeric"
Using the response variable of Response Duration, there were not many options for continuous variables that seemed to have anything like a linear relationship with response time. Number of animals, the “Initial Date” derived fields, the remainder of the “Response Date” fields, and “Time to Respond” fields were all considered, and none provided a particularly good example of a linear relationship, as the average and variance of response times seems to stay pretty consistent across the available continuous variables. The below graph was used to explore and see which may have been the best fit. Ultimately, I settled on “Time_Response” which is the time of day in which a response call was started (different than when the request for a call came in - those are the “Initial” fields).
ggplot(data = wildlife, aes(x = Time_Response, y = log(Response_Duration))) +
geom_point()
model <- lm(Response_Duration ~ Time_Response, wildlife)
model$coefficients
## (Intercept) Time_Response
## 1.82007870 -0.03177076
Based on the model built, the average change per hour is -0.032 hours, or rather -1.9 minutes. This means for every hour further into the day, there is an expected decrease of two minutes per response. I was particularly interested in this and the other time variables, as I was curious to see if response times got shorter later in the day, or when it was colder out, indicating that ranger comfort may have been a factor in how long calls lasted. While there does seem to be some relationship here, less than 2 minutes difference when the average response is somewhere around the hour mark is simply not meaningful.
summary(model)
##
## Call:
## lm(formula = Response_Duration ~ Time_Response, data = wildlife)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.570 -0.812 -0.346 0.498 73.180
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.820079 0.079906 22.778 < 2e-16 ***
## Time_Response -0.031771 0.006235 -5.096 3.57e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.594 on 6383 degrees of freedom
## Multiple R-squared: 0.004052, Adjusted R-squared: 0.003896
## F-statistic: 25.97 on 1 and 6383 DF, p-value: 3.572e-07
Based on the summary of the model above, we can glean a few key pieces of information about the dataset, along with the relationship between Response Duration and the Start Time of the Response Call.
The first is denoted by the Multiple R Squared, which is pretty low. Normally, this is a value that would be used to compare to a different model to see which is greater; however, since I only have the one model, it can still tell us that only a small part of the variance is being explained by this model.
The second piece of information comes from the F-Statistic and the p-value. The F-Stat is 25.97, which is not small, which suggests that Time_Response is adding accuracy to the model prediction. The P-Value is very small, which indicates that the relationship being observed between these two variables is not just from chance and variation in the data.
These two pieces of information combined tell us that while there is a strong relationship between “Response_Duration” and “Time_Response” that is adding accuracy to the model, only a fraction of the variation in the model is due to “Time_Response”. The model would likely be greatly improved by the addition of other variables.