Squirrel Census

Author

Julian Beckert

2018 Squirrel Census

squirrel on a rock with a piece of bread in its mouth

Provided to Unsplash by @judevin on Instagram

A group called “The Squirrel Census” conducted a census of all of the squirrels in NYC’s Central Park in 2018. They conducted this survey with the help of several hundred volunteers over the course of two weeks in October (Allen, 2022). Along with the count, the volunteers provided a ton of other information, like the temperature in the hectares they were assigned, the numberofsquirrels in each, the primaryfurcolor of each squirrel, and random other notes including the different things the squirrels they met were doing. I’ve always loved squirrels, so when I saw this dataset I immediately decided to use it. I want to look at the potential correlations between variables in this dataset, like the effect of weather on squirrel numbers (with the hypothesis that lower temperature causes less squirrel sightings), and I also just want to plot the squirrels on a map. I cleaned up my data almost entirely in R, with the exception of going into the CSV file to edit a few weather observations that used Celsius instead of Fahrenheit. I used select, mutate, filter, gsub, and other commands to filter and subset my data, remove unnecessary columns, and remove and change unwanted characters or words. I also used a function to convert a series of boolean columns into one column that listed the names of the “TRUE” columns in each row.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(highcharter)
Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 
Highcharts (www.highcharts.com) is a Highsoft software product which is
not free for commercial and Governmental use
library(leaflet)
library(sf)
Linking to GEOS 3.13.0, GDAL 3.8.5, PROJ 9.5.1; sf_use_s2() is TRUE
library(ggfortify)
squirrel <- read_csv("centralparksquirrelcensus.csv")
Rows: 3023 Columns: 31
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): Unique Squirrel ID, Hectare, Shift, Age, Primary Fur Color, Highli...
dbl  (4): Y, X, Date, Hectare Squirrel Number
lgl (13): Running, Chasing, Climbing, Eating, Foraging, Kuks, Quaas, Moans, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
hectare <- read_csv("centralpartsquirrel_hectare.csv")
Rows: 700 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): Hectare, Shift, Sighter Observed Weather Data, Litter, Litter Notes...
dbl (5): Date, Anonymized Sighter, Number of sighters, Number of Squirrels, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Basic cleaning

names(squirrel) <- tolower(names(squirrel))
names(squirrel) <- gsub(" ","",names(squirrel))
names(hectare) <- tolower(names(hectare))
names(hectare) <- gsub(" ","",names(hectare))

# remove irrelevant coumns
hectare <- hectare |>
    select(-litternotes,-hectareconditionsnotes, -otheranimalsightings, -date, -anonymizedsighter)
squirrel <- squirrel |>
  select(-colornotes, -specificlocation, -otherinteractions, -date)


squirrel <- squirrel |>
  rename_at("x", ~"long") |>
  rename_at("y", ~ "lat")
# I needed to rename these columns to make them work in Leaflet. source for command: Statology

# turning the "sighterobservedweatherdata" column into just "temperature"
hectare <- hectare |>
  mutate(temperature = gsub("\\D", "", sighterobservedweatherdata))
hectare$temperature <- as.numeric(as.character(hectare$temperature)) 
# source for command: u/ksalot on Reddit, next line wouldn't run without it

hectare <- hectare |>
  filter(temperature <99)
# the previous command turned entries like "~67-70 degrees" into "6770", so I had to remove rows with more than 2 digits

Make function: I want to plot the squirrels on a map and list what they’re doing, but the behavior/activity of the squirrels is only described in a series of true/false columns. I tried a lot of different ways to print the names of the “true” columns for each row into their own rows but couldn’t figure it out. One of my friends explained that what I need to do is write a function that will do this. The function she wrote ended up not working, so I had to plug it into ChatGPT to rewrite it to something that would work and explain why.

list_activities <- function(run, chase, climb, eat, forage, kuk, quaa, tailflag) {
  activities <- c() # c() is needed to fill values into rows as a vector (list)
  if (run) activities <- c(activities, "Running")
  if (chase) activities <- c(activities, "Chasing")
  if (climb) activities <- c(activities, "Climbing")
  if (eat) activities <- c(activities, "Eating")
  if (forage) activities <- c(activities, "Foraging")
  if (kuk) activities <- c(activities, "'Kuk's")
  if (quaa) activities <- c(activities, "'Quaa's")
  if (tailflag) activities <- c(activities, "Tail flags")
  
  if (length(activities) == 0) { # message for when each column is false
    return("Doing nothing")
  } else {
    return(paste(activities, collapse = ", "))
  }
}

squirrel <- squirrel |>
  rowwise() |> # applies this to rows
  mutate(activities = list_activities(running, chasing, climbing, eating, foraging, kuks, quaas, tailflags)) |> # no columns are actually called in the function itself. this is where the columns are called, written in the order I want the function to apply to them
  ungroup() 

squircensus <- left_join(hectare, squirrel) # join the two datasets
Joining with `by = join_by(hectare, shift)`
# thank you Hazel for helping :)

Statistical analysis

squirlm <- lm(temperature ~ numberofsquirrels + totaltimeofsighting + numberofsighters, data = squircensus)

autoplot(squirlm)

The first plot (residuals vs fitted) has a horizontal line with the residual points scattered around it, which indicates that a linear model is appropriate for this data. The normal Q-Q plot shows that the residuals are relatively normally distributed but there are a decent amount of outliers.

cor(squircensus$numberofsquirrels, squircensus$temperature)
[1] 0.1649474
# correlation is 0.16, little to no correlation

summary(lm(temperature ~ numberofsquirrels, data = squircensus))

Call:
lm(formula = temperature ~ numberofsquirrels, data = squircensus)

Residuals:
     Min       1Q   Median       3Q      Max 
-28.9433  -7.6650  -0.8873   7.5574  23.8903 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       58.94327    0.37022 159.213   <2e-16 ***
numberofsquirrels  0.38881    0.04457   8.724   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.868 on 2721 degrees of freedom
Multiple R-squared:  0.02721,   Adjusted R-squared:  0.02685 
F-statistic:  76.1 on 1 and 2721 DF,  p-value: < 2.2e-16

The p-value is <2e-16, meaning there is very high statistical significance, but the multiple R-squared value is tiny, only 0.027. The equation for the linear model is temperature = 58.9 + 0.39(numberofsquirrels), meaning that for every squirrel the temperature will increase by 0.39 degrees — but it only accounts for 2.7% of the variation so it isn’t an effective way to calculate the temperature.

Visualizations

#1: Dotplot

hectare |>
  ggplot(aes(x = temperature, y = numberofsquirrels, color = totaltimeofsighting)) +
  scale_color_gradient(low = "#0040BF", high = "#FAAE7B", aesthetics = "color") +
  geom_point(size = 3, alpha = 0.3) +
  labs(x = "Temperature (°F)",
       y = "Number of Squirrels Seen",
       title = "Temperature by Squirrel Numbers",
       caption = "Source: The Squirrel Census",
       color = "Length of search (mins)") +
  theme_light(base_size = 12, base_family = "serif")

My hypothesis that lower temperatures mean less squirrel sightings is supported by this plot, as there’s clearly more data on squirrels and more squirrels sighted between 50-80° than 40°. However, there’s still not a lot of data on a range of temperatures. The most concentration of data falls between 50-70°, with the highest numbers of squirrels seen falling between 60° and 70°. The two outliers also align with my hypothesis that lower temperature brings lower squirrel counts; a temperature of about 65°F brought the highest squirrel count (over 20 squirrels), and the lowest temperature, about 30°, had a count of 0 squirrels.

#2: Density plot

I made a boxplot at first, but it wasn’t very insightful (and was ugly) so I made a density plot instead.

# subset
tempsquirrel <- squircensus |>
  select(temperature, numberofsquirrels, primaryfurcolor) |>
  arrange(primaryfurcolor)

# subset further
greysquirrel <- tempsquirrel |> filter(primaryfurcolor == "Gray")
blacksquirrel <- tempsquirrel |> filter(primaryfurcolor == "Black")
cinnamonsquirrel <- tempsquirrel |> filter(primaryfurcolor == "Cinnamon")

hchart(
  # first density plot (temperature)
  density(tempsquirrel$temperature), type = "area",
  color = "#ad755a",
  name = "Temperature distribution") |>
  # second density plot (black squirrels)
    hc_add_series(
    density(blacksquirrel$temperature), type = "area",
    color = "#000", 
    name = "Black Squirrel") |>
  # third density plot (grey squirrels)
    hc_add_series(
    density(greysquirrel$temperature), type = "area",
    color = "#c5bdc9", 
    name = "Grey Squirrel") |>
  # fourth density plot (cinnamon squirrels)
      hc_add_series(
  density(cinnamonsquirrel$temperature), 
  type = "area", name = "Cinnamon Squirrel", color = "#7f3300") |>
  # text features
  hc_title(text="Squirrel Activity in Different Temperatures by Color",
           margin = 30,
    align = "center",
    style = list(color = "#4c2918") ) |>
  hc_subtitle(text="Source: The Squirrel Census",
              style = list(color = "#594135")) |>
  hc_xAxis(title = list(text="Temperature (°F)",
                        margin = 5,
                        style = list(color = "#594135"))) 
# source for plot style: Datanovia

This graph is showing us a very slight correlation between temperature and which colors of squirrels were found. The data was taken over two weeks in October with mostly warm temperatures, the lowest 30°F and the highest 84°F, which is a wide range. Black squirrels are seen at an especially high frequency at the high 50s temperature range compared to the other colors, and peak at 58°F. There’s a dip in data taken during 74°F, but cinnamon squirrels are much less affected by this than the other squirrels. All of the squirrels are multimodal and have high variability, but the black squirrels have the highest peak. Additionally, the shape of the grey squirrel curve lines up very closely to the temperature curve. Grey squirrels make up the largest part of this dataset and are found across the entire park in large numbers across all days of the census.

I think the reason why there’s a correlation between temperature and number of squirrels spotted is due to an abundance of data for a certain temperature range. I think it’s possible that the temperature correlates somewhat more to human activity than squirrel activity. Squirrels have no choice but to be outdoors, humans can decide when they want to go out and look at them. That said, I have no basis for this speculation because I haven’t been able to find information on the exact methodology of the census.

#3: GIS squirrel visualization using Leaflet

# color variable, to color the squirrel markers
behaviorcolor <- squirrel %>% 
  mutate(color = case_when(str_detect(primaryfurcolor, "Gray") ~ "#c5bdc9",
                           str_detect(primaryfurcolor, "Black") ~ "#000000",
                           str_detect(primaryfurcolor, "Cinnamon") ~ "#7f3300",
                           str_detect(primaryfurcolor, "NA") ~ "#fc2393",
                           TRUE ~ "a default",))
# source: jlacko on Posit

popupsq <- paste0(
      "<b>Age: </b>", squirrel$age, "<br>",
      "<b>Highlight Color: </b>", squirrel$highlightfurcolor, "<br>",
      "<b>Location: </b>", squirrel$location, "<br>",
      "<b>Activity: </b>", squirrel$activities, "</br>",
      "<b>Notes: </b>", squirrel$otheractivities, "</br>")

leaflet() |>
  setView(lng = -73.968285, lat = 40.785091, zoom =15) |>
  addProviderTiles("Esri.WorldImagery") |>
  addCircles(data = squirrel, radius = 4, color = ~behaviorcolor$color, opacity = 0.75, popup = popupsq) |>
  addLegend("bottomright",
            colors = c("#c5bdc9", "#7f3300", "#000000"),
            labels = c("Grey", "Cinnamon", "Black"),
            title = "Squirrel primary color",
            opacity = 0.9)
Assuming "long" and "lat" are longitude and latitude, respectively
# source for custom legend: bathyscapher on Stack Overflow

I mostly made this plot for fun, but also to visualize the physical distribution of the squirrels, their colors and what they’re doing. A lot of the volunteers added their own notes about what the squirrels were doing and I wanted to include that in a visualization to give more context to the squirrels and their behavior. Two interesting things I noticed is that there’s a whole clustered area of black squirrels and few squirrel sightings not in the boundaries of Central Park. Additionally, some of the squirrels are playing with each other, one squirrel is “fighting with pigeons”, and a group of squirrels are all being harassed by the same two dogs. Unsurprisingly, there’s more squirrel sightings in areas with a lot of trees, and very few along streets or large sidewalks.

There are some very obvious flaws with this dataset. The main and most obvious flaw is that the same squirrels have definitely been logged multiple times. The squirrels in the dataset are just squirrel sightings without any formal physical tagging (The Squirrel Census, 2022) and the census took place over two weeks. This is why the amount of categorical variables is so relevant; there’s so many other factors to look at, like the behaviors the squirrels are exhibiting, the behaviors they’re exhibiting in proximity to others, the temperature they’re most likely to be found in, etc. However, I wasn’t able to find any strong correlations between categorical variables my exploration, but this is partially because of lack of experience with R. There’s some variables I didn’t know how to call on, like physical location or proximity between squirrels, and I’m curious if there’s correlation between the squirrels’ behaviors and how close they are to other squirrels. Overall, I love this dataset and feel motivated to spend a lot more time working with it. I also want to find a dataset of squirrels or other animals with this many variables, but that’s conducted by professionals who actually physically tagged the animals to prevent duplicates. There’s a lot of things I wanted to do with this that I didn’t have time to do for this project; I want to try and see if there’s correlation between squirrel behavior and their proximity to other squirrels, however that’s done; I want to find out how to add a drop-down menu to the leaflet map so you can filter for only squirrels with custom notes instead of making a second map for it; I also really want to get better at using Highcharter because it’s been such a struggle to use.

References

Allen, J. (2022, October 11). Getting to Know Central Park’s Squirrels. Central Park Conservancy. https://www.centralparknyc.org/articles/getting-to-know-central-parks-squirrels

The Squirrel Census. (2022). Data. The Squirrel Census. https://www.thesquirrelcensus.com/data

For code:

Statology (https://www.statology.org/rename-single-column-in-r/) for renaming columns to work in leaflet

Reddit (u/ksalot) (https://www.reddit.com/r/rstats/comments/p3gi8/help_needed_populating_blanks_with_nas/) when struggling to convert a column to numeric

Posit (jlacko) (https://forum.posit.co/t/leaflet-in-r-how-to-color-sites-from-parameters-that-may-contain-a-list/69103/4) to create custom colors for values in Leaflet

Stack Overflow (bathyscapher) (https://stackoverflow.com/questions/38161073/manually-adding-legend-values-in-leaflet) creating custom legend in Leaflet

Datanovia (https://www.datanovia.com/en/lessons/highchart-interactive-density-and-histogram-plots-in-r/) for stacked density plot

Chatgpt, used to correct and explain the function for the “activities” column:

OpenAI. (2024). ChatGPT (May 2024 version) [Large language model]. https://chat.openai.com/chat