Does the temperature have a relationship to the number of squirrel? The hypothesis is that as the temperature increases so will the numnber of squirrel sightings in NYC Central Park.
library(dplyr)
library(tidyr)
library(readr)
library(stringr)
squirrels <- read.csv("https://raw.githubusercontent.com/johnnydrodriguez/data605/main/squirrel.csv", header = TRUE, sep = ',', na.strings="", fill = TRUE)
head(squirrels)
## Hectare Shift Date Anonymized.Sighter Sighter.Observed.Weather.Data
## 1 01A AM 10072018 110 70º F, Foggy
## 2 01A PM 10142018 177 54º F, overcast
## 3 01B AM 10122018 11 60º F, sunny
## 4 01B PM 10192018 109 59.8º F, Sun, Cool
## 5 01C PM 10132018 241 55° F, Partly Cloudy
## 6 01D AM 10062018 250 66º F, cloudy, light drizzle
## Litter Litter.Notes Other.Animal.Sightings
## 1 Some <NA> Humans, Pigeons
## 2 Abundant <NA> Humans, Pigeons
## 3 Some <NA> Humans, Dogs, Pigeons, Horses
## 4 Some <NA> Humans, Dogs, Pigeons, Sparrow, Blue jay
## 5 None <NA> Humans, Dogs, Pigeons, Birds
## 6 Some <NA> Humans
## Hectare.Conditions Hectare.Conditions.Notes Number.of.sighters
## 1 Busy <NA> 1
## 2 Busy <NA> 1
## 3 Busy <NA> 1
## 4 Busy <NA> 1
## 5 Busy <NA> 1
## 6 Calm <NA> 1
## Number.of.Squirrels Total.Time.of.Sighting
## 1 4 22
## 2 7 26
## 3 17 23
## 4 10 35
## 5 10 25
## 6 9 35
df <- squirrels %>%
mutate(Temperature = substr(`Sighter.Observed.Weather.Data`, 1, 2), # Extract the first two characters
Temperature = if_else(str_detect(Temperature, "^\\d{2}$"), as.numeric(Temperature), NA_real_)) %>%
select(`Number.of.Squirrels`, Temperature) %>%
drop_na(Temperature) # Drop rows where Temperature is NA
head(df)
## Number.of.Squirrels Temperature
## 1 4 70
## 2 7 54
## 3 17 60
## 4 10 59
## 5 10 55
## 6 9 66
We plot the Temperature on the X axis and the Number of Squirrels in the Y axis. We fit the simple linear regression model over the plot.
Immediately we can tell that there does not appear to be a linear relationship between squirrel numbers and temperature increase. The sightings appear to be clustered and not normally distributed.
# Fit linear model
squirrel_model <- lm(`Number.of.Squirrels` ~ Temperature, data = df)
# Plot
plot(df$Temperature, df$`Number.of.Squirrels`,
main = "Actual vs Fitted Values",
xlab = "Temperature",
ylab = "Number of Squirrels",
pch = 19, col = "darkgray")
# Add the regression line
abline(squirrel_model, col = "red")
Although the numbers suggest a positive relationship between temperature and number of squirrels, the extremely low R squared value indicates this model would not be able to number of squirrels by temperature.
Coefficients:
Temperature (0.05084): The coefficient for temperature suggests that for every one-degree Celsius increase in temperature, the number of squirrels is expected to increase by approximately 0.051.
Standard Error:
Temperature (0.01316): The standard error associated with the temperature coefficient is relatively small, indicating a more precise estimate of how temperature affects the number of squirrels observed.
T-value:
Temperature (3.864): The positive t-value for temperature indicates a strong and statistically significant positive relationship between temperature and the number of squirrels observed.
P-value:
Temperature (0.000124): The extremely small p-value for temperature indicates strong evidence against the null hypothesis, suggesting that temperature has a statistically significant effect on the number of squirrels observed.
Multiple R-squared (0.02401):
This value indicates that approximately 2.4% of the variability in the number of squirrels observed can be explained by the model. This low R-squared value suggests that other factors are influencing the number of squirrels observed.
# Summary
summary(squirrel_model)
##
## Call:
## lm(formula = Number.of.Squirrels ~ Temperature, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.4120 -2.7002 -0.7002 1.8590 18.3506
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.34485 0.80731 1.666 0.096263 .
## Temperature 0.05084 0.01316 3.864 0.000124 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.567 on 607 degrees of freedom
## Multiple R-squared: 0.02401, Adjusted R-squared: 0.0224
## F-statistic: 14.93 on 1 and 607 DF, p-value: 0.0001236
Residuals vs Fitted
The slight curve in the red line and the funnel shape of the points as we move from left to right suggest non-linearity and heteroscedasticity
Normal Q-Q Plot
The points follow the line closely in the center but deviate in the tails, with a few points in the upper right showing a strong deviation from the expected under normality. This indicates the residuals may have heavier tails than the normal distribution, and the presence of potential outliers could be influencing the distribution of residuals.
Scale-Location
There is an observable pattern where the spread increases that can signify heteroscedasticity.
Residuals vs Leverage
Most points cluster to the left, indicating low leverag and no points exceed Cook’s distance lines, and no individual observations that have a large influence . A few observations 45, 187, 554) may be outliers but do not exceed Cook’s distance threshold.
Conclusion for Simple Linear Regression Model
The model would not be appropriate as it violates theese model assumptions:
# Residuals plots
par(mfrow=c(2,2))
plot(squirrel_model)