2018 Central Park NYC Squirrel Census

Does the temperature have a relationship to the number of squirrel? The hypothesis is that as the temperature increases so will the numnber of squirrel sightings in NYC Central Park.

library(dplyr)
library(tidyr)
library(readr)
library(stringr)

squirrels <- read.csv("https://raw.githubusercontent.com/johnnydrodriguez/data605/main/squirrel.csv", header = TRUE, sep = ',', na.strings="", fill = TRUE)
head(squirrels)
##   Hectare Shift     Date Anonymized.Sighter Sighter.Observed.Weather.Data
## 1     01A    AM 10072018                110                  70º F, Foggy
## 2     01A    PM 10142018                177               54º F, overcast
## 3     01B    AM 10122018                 11                  60º F, sunny
## 4     01B    PM 10192018                109            59.8º F, Sun, Cool
## 5     01C    PM 10132018                241          55° F, Partly Cloudy
## 6     01D    AM 10062018                250  66º F, cloudy, light drizzle
##     Litter Litter.Notes                   Other.Animal.Sightings
## 1     Some         <NA>                          Humans, Pigeons
## 2 Abundant         <NA>                          Humans, Pigeons
## 3     Some         <NA>            Humans, Dogs, Pigeons, Horses
## 4     Some         <NA> Humans, Dogs, Pigeons, Sparrow, Blue jay
## 5     None         <NA>             Humans, Dogs, Pigeons, Birds
## 6     Some         <NA>                                   Humans
##   Hectare.Conditions Hectare.Conditions.Notes Number.of.sighters
## 1               Busy                     <NA>                  1
## 2               Busy                     <NA>                  1
## 3               Busy                     <NA>                  1
## 4               Busy                     <NA>                  1
## 5               Busy                     <NA>                  1
## 6               Calm                     <NA>                  1
##   Number.of.Squirrels Total.Time.of.Sighting
## 1                   4                     22
## 2                   7                     26
## 3                  17                     23
## 4                  10                     35
## 5                  10                     25
## 6                   9                     35


Clean Data and Identify Variables

df <- squirrels %>%
  mutate(Temperature = substr(`Sighter.Observed.Weather.Data`, 1, 2),  # Extract the first two characters
         Temperature = if_else(str_detect(Temperature, "^\\d{2}$"), as.numeric(Temperature), NA_real_)) %>%
  select(`Number.of.Squirrels`, Temperature) %>%
  drop_na(Temperature)  # Drop rows where Temperature is NA

head(df)
##   Number.of.Squirrels Temperature
## 1                   4          70
## 2                   7          54
## 3                  17          60
## 4                  10          59
## 5                  10          55
## 6                   9          66


The squirrel model

We plot the Temperature on the X axis and the Number of Squirrels in the Y axis. We fit the simple linear regression model over the plot.

Immediately we can tell that there does not appear to be a linear relationship between squirrel numbers and temperature increase. The sightings appear to be clustered and not normally distributed.

# Fit linear model
squirrel_model <- lm(`Number.of.Squirrels` ~ Temperature, data = df)

# Plot
plot(df$Temperature, df$`Number.of.Squirrels`, 
     main = "Actual vs Fitted Values", 
     xlab = "Temperature", 
     ylab = "Number of Squirrels",
     pch = 19, col = "darkgray")

# Add the regression line 
abline(squirrel_model, col = "red")


Interpretation of the Model

Although the numbers suggest a positive relationship between temperature and number of squirrels, the extremely low R squared value indicates this model would not be able to number of squirrels by temperature.

Coefficients:

Temperature (0.05084): The coefficient for temperature suggests that for every one-degree Celsius increase in temperature, the number of squirrels is expected to increase by approximately 0.051.

Standard Error:

Temperature (0.01316): The standard error associated with the temperature coefficient is relatively small, indicating a more precise estimate of how temperature affects the number of squirrels observed.

T-value:

Temperature (3.864): The positive t-value for temperature indicates a strong and statistically significant positive relationship between temperature and the number of squirrels observed.

P-value:

Temperature (0.000124): The extremely small p-value for temperature indicates strong evidence against the null hypothesis, suggesting that temperature has a statistically significant effect on the number of squirrels observed.

Multiple R-squared (0.02401):

This value indicates that approximately 2.4% of the variability in the number of squirrels observed can be explained by the model. This low R-squared value suggests that other factors are influencing the number of squirrels observed.

# Summary
summary(squirrel_model)
## 
## Call:
## lm(formula = Number.of.Squirrels ~ Temperature, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4120 -2.7002 -0.7002  1.8590 18.3506 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.34485    0.80731   1.666 0.096263 .  
## Temperature  0.05084    0.01316   3.864 0.000124 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.567 on 607 degrees of freedom
## Multiple R-squared:  0.02401,    Adjusted R-squared:  0.0224 
## F-statistic: 14.93 on 1 and 607 DF,  p-value: 0.0001236


Residuals Analysis of the Linear Regression Model

Residuals vs Fitted

The slight curve in the red line and the funnel shape of the points as we move from left to right suggest non-linearity and heteroscedasticity

Normal Q-Q Plot

The points follow the line closely in the center but deviate in the tails, with a few points in the upper right showing a strong deviation from the expected under normality. This indicates the residuals may have heavier tails than the normal distribution, and the presence of potential outliers could be influencing the distribution of residuals.

Scale-Location

There is an observable pattern where the spread increases that can signify heteroscedasticity.

Residuals vs Leverage

Most points cluster to the left, indicating low leverag and no points exceed Cook’s distance lines, and no individual observations that have a large influence . A few observations 45, 187, 554) may be outliers but do not exceed Cook’s distance threshold.

Conclusion for Simple Linear Regression Model

The model would not be appropriate as it violates theese model assumptions:

  • The Residuals vs Fitted plot suggests non-linearity and possible heteroscedasticity.
  • The Q-Q plot indicates potential non-normality of residuals, with the presence of outliers.
  • The Scale-Location plot suggests heteroscedasticity.
# Residuals plots
par(mfrow=c(2,2))
plot(squirrel_model)