Week 9 Data Dive

library(tidyverse)

## Warning: package 'lubridate' was built under R version 4.4.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(lubridate)
library(pwrss)

## 
## Attaching package: 'pwrss'
## 
## The following object is masked from 'package:stats':
## 
##     power.t.test

library(effsize)
library(boot)
library(ggthemes)
library(ggrepel)
library(broom)
library(lindia)


wildlife <- read_delim("./Urban_Wildlife_Response_Edited.csv", delim = ",")

## Rows: 6385 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): DT_Initial, DT_Response, Time_To_Respond, Borough, Property, Locat...
## dbl  (3): Response_Duration, Num_of_Animals, Hours_Monitoring
## lgl  (4): PEP_Response, Animal_Monitored, Police_Response, ESU_Response
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

wildlife <- wildlife |>
  mutate(DT_Response = as.POSIXct(DT_Response, format = "%m/%d/%Y %H:%M")) |>
  mutate(Time_Response = as.numeric(format(DT_Response, "%H")) + 
                        as.numeric(format(DT_Response,"%M")) / 60) |>
  mutate(Year_Response = as.numeric(format(DT_Response, "%Y"))) |>
  mutate(Time_To_Respond = as.POSIXct(DT_Response, format = "%H:%M")) |>
  mutate(Time_To_Respond_Num = as.numeric(format(Time_To_Respond, "%H")) + 
                        as.numeric(format(Time_To_Respond,"%M")) / 60) |>
  mutate(Animal_Found = if_else(Final_Ranger_Action == "Unfounded", 0, 1)) |>
  mutate(Animal_Healthy = if_else(Animal_Condition == "Healthy", 1, 0)) |>
  mutate(Animal_DOA = if_else(Animal_Condition == "DOA", 1, 0))

Linear Regression Model

Estimating Response Duration

For week 8, I built a linear regression model estimating the response duration based on the “Time_Response” variable. The original model and it’s summary has been shown below.

model <- lm(Response_Duration ~ Time_Response, wildlife)

summary(model)

## 
## Call:
## lm(formula = Response_Duration ~ Time_Response, data = wildlife)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.570 -0.812 -0.346  0.498 73.180 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.820079   0.079906  22.778  < 2e-16 ***
## Time_Response -0.031771   0.006235  -5.096 3.57e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.594 on 6383 degrees of freedom
## Multiple R-squared:  0.004052,   Adjusted R-squared:  0.003896 
## F-statistic: 25.97 on 1 and 6383 DF,  p-value: 3.572e-07

Adding Animal_Found

The next variable I’d like to add to my model is the binary “Animal_Found” variable. This indicates whether or not the animal was found. The coefficient of 0.5 suggests that if an animal is found, the response lasts about 30 minutes longer than if the animal is not found. I will continue to include this variable in the model, as it has improved the \(R^2\) value significantly, taking the model from an \(R^2\) of 0.004 to an \(R^2\) of 0.022.

model_2 <- lm(Response_Duration ~ Time_Response + Animal_Found, wildlife)

summary(model_2)

## 
## Call:
## lm(formula = Response_Duration ~ Time_Response + Animal_Found, 
##     data = wildlife)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.673 -0.639 -0.179  0.425 73.077 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.395642   0.088160  15.831  < 2e-16 ***
## Time_Response -0.030960   0.006178  -5.012 5.55e-07 ***
## Animal_Found   0.527371   0.048186  10.944  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.58 on 6382 degrees of freedom
## Multiple R-squared:  0.0224, Adjusted R-squared:  0.02209 
## F-statistic: 73.12 on 2 and 6382 DF,  p-value: < 2.2e-16

Adding Year_Response

For the third variable, I wanted to try adding another continuous variable. I had considered using “Year_Response” in week 8’s data dive, but ultimately went another direction, so I was curious about if it would improve the model or not. Based on the summary below, I would likely not include “Year-Response”, as it does not appear to add to the model. The \(R^2\) for this model is essentially the same as the two-variable model above, and the p-value for “Year_Response” is 0.161. While I do not have a set significance level to compare to, 0.161 is pretty high, and is not strong enough to suggest “Year_Response” is statistically significant. For these reasons, I will be removing this from the model going forward.

model_3 <- lm(Response_Duration ~ Time_Response + Animal_Found + Year_Response, wildlife)

summary(model_3)

## 
## Call:
## lm(formula = Response_Duration ~ Time_Response + Animal_Found + 
##     Year_Response, data = wildlife)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.695 -0.645 -0.189  0.430 73.067 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -28.989957  21.666435  -1.338    0.181    
## Time_Response  -0.030882   0.006178  -4.999 5.92e-07 ***
## Animal_Found    0.525919   0.048194  10.913  < 2e-16 ***
## Year_Response   0.015033   0.010719   1.402    0.161    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.58 on 6381 degrees of freedom
## Multiple R-squared:  0.0227, Adjusted R-squared:  0.02224 
## F-statistic: 49.41 on 3 and 6381 DF,  p-value: < 2.2e-16

Adding Animal_Healthy

The binary variable “Animal_Healthy” is determined by the “Animal_Condition” field. If there is a value in that field, it could be Healthy, Unhealthy, Injured, or DOA. By adding an additional column, we can see the relationship between a healthy animal and the response time. Based on the summary below, we can see that compared to the two variable model (model_2) above, this model does have a higher \(R^2\) value at 0.031 vs 0.022. The added variable of “Animal_Healthy” also has a pretty small p-value, indicating it’s a good predictor variable. The F-Stat for this model is somewhat lower than the 2 variable model, but given the improvement in \(R^2\) and small p-value, this just signifies the improvement from the 3rd variable wasn’t as large as the other two variables, even though it is still improving. Because of the improvement, I will continue to include this variable.

model_4 <- lm(Response_Duration ~ Time_Response + Animal_Found + Animal_Healthy, wildlife)

summary(model_4)

## 
## Call:
## lm(formula = Response_Duration ~ Time_Response + Animal_Found + 
##     Animal_Healthy, data = wildlife)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.646 -0.631 -0.161  0.418 33.995 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.255215   0.072522  17.308  < 2e-16 ***
## Time_Response  -0.017487   0.005057  -3.458 0.000547 ***
## Animal_Found    0.549795   0.040197  13.677  < 2e-16 ***
## Animal_Healthy -0.173810   0.035252  -4.930 8.41e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.287 on 6345 degrees of freedom
##   (36 observations deleted due to missingness)
## Multiple R-squared:  0.03139,    Adjusted R-squared:  0.03093 
## F-statistic: 68.54 on 3 and 6345 DF,  p-value: < 2.2e-16

Adding Animal_DOA

The last variable I wanted to include is the “Animal_DOA” variable. It, like “Animal_Healthy” is dependent on the “Animal_Status” variable. If the value of that field is “DOA”, this new field is 1, otherwise it’s null. This helps to estimate what affect an animal being dead on arrival has on the response time. Similarly to “Animal_Healthy”, this variable improves the \(R^2\) and has a low p-value, indicating an improvement in the model and statistical significance, respectively. The addition of “Animal_DOA”, however, also increases the F-Stat, which is at 95.27 versus the 2 variable model of 73.12. Because of the significance and the improvement in the model, I would also keep this variable in the model.

model_5 <- lm(Response_Duration ~ Time_Response + Animal_Found + Animal_Healthy + Animal_DOA, wildlife)

summary(model_5)

## 
## Call:
## lm(formula = Response_Duration ~ Time_Response + Animal_Found + 
##     Animal_Healthy + Animal_DOA, data = wildlife)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.778 -0.682 -0.225  0.328 33.965 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.299312   0.071655  18.133  < 2e-16 ***
## Time_Response  -0.018426   0.004991  -3.692 0.000224 ***
## Animal_Found    0.658794   0.040544  16.249  < 2e-16 ***
## Animal_Healthy -0.304604   0.036209  -8.412  < 2e-16 ***
## Animal_DOA     -0.704236   0.054016 -13.038  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.27 on 6344 degrees of freedom
##   (36 observations deleted due to missingness)
## Multiple R-squared:  0.05666,    Adjusted R-squared:  0.05607 
## F-statistic: 95.27 on 4 and 6344 DF,  p-value: < 2.2e-16

Evaluating the Model

Residuals vs Fitted Values

The trend line for the residuals remains relatively close to zero across the model, with some areas showing a slight overestimation. We can also see a very slight increase in variation as the fitted value increases. This increase is not large, however, so we can still assume that this model abides by the assumption that errors have constant variance across all predictions. Overall, while there seems to be more regular over-estimation, the residuals average out to around 0.

gg_resfitted(model_5) +
  geom_smooth(se=FALSE)

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Residuals vs X Values

Similarly to looking at the residuals vs the fitted model, there is some more variation on the overestimation side, however, the model stays pretty close to zero regardless of the “Time_Response” variable.

plots <- gg_resX(model_5, plot.all = FALSE)

# for each variable of interest ...
plots$Time_Response +
  geom_smooth(se = FALSE)

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Residual Histogram

The residual Histogram is roughly normal, though the right tail is longer. This matches up with the previous graph showing more variability in the residuals on the overestimated end. It also indicates there is likely still something missing from the model. Despite the slight skew, this still generally supports the assumption that errors are normally distributed.

gg_reshist(model_5)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

QQ Plot

As indicated with the histogram, the right tail is longer than the left, which we can see a more exaggerated view of with the QQ Plot. With the exception of the right tail, the QQ Plot follows the line well, indicating a roughly normal distribution.

gg_qqplot(model_5)

Cook’s D by Observation

First, we need to find rows of data that may have a large influence on the model. Due in part to the large number of rows, there are many rows that are being flagged. I have have selected and plotted the top 6 below.

gg_cooksd(model_5, threshold = 'matlab')

ggplot(data = slice(wildlife, 
                    c(221,330,297,1808,4175,6305))) +
  geom_point(data = wildlife,
             mapping = aes(x = Time_Response, y = Response_Duration)) +
  geom_point(mapping = aes(x = Time_Response, y = Response_Duration),
             color = 'hotpink') +
  geom_text_repel(mapping = aes(x = Time_Response, 
                                y = Response_Duration,
                                label = Species_Status),
             color = 'hotpink') +
  labs(title="High Influence Points",
       subtitle="Label = Species Status")

Interestingly to me, the data points flagged by the Cook’s D plot don’t actually appear to be outliers in the plot comparing Time_Response to Response_Duration. I tried several labels, to see if I could find any similarities, and the Species_Status was the most interesting, with five of the six being “Native”. That could just be due to the proportion of rows containing “Native” animal responses.

If I had any other continuous variables in the model, I would attempt to visualize those, but my remaining three variables are all binary and do not offer any greater understanding. Overall, I’m not sure what to make of this particular evaluation.

Conclusion

The model with 4 variables (Time_Response, Animal_Found, Animal_Healthy, and Animal_DOA) is a reasonably accurate and stable model, though there is certainly room for improvement. There are no glaring issues based on the regression statistics or the graphic evaluations.