Animal Shelter Operations Analytics

2025-12-13

Introduction

This report analyzes the Animal Shelter Intakes and Outcomes dataset from the City of Long Beach Animal Care Services. The goal is to uncover operational trends, understand live-release performance, and support data-driven decisions.

Per the challenge instructions, we will focus on: 1. Data cleaning and consistency checks. 2. Temporal trends (seasonal, day-of-week). 3. Live Release Rates (LRR) by various factors. 4. Intake and outcome correlations. 5. Length of Stay (LOS) analysis. 6. Repeat intakes.

1. Load Libraries and Data

Explanation: First, we load the necessary libraries for data manipulation (tidyverse), date handling (lubridate), and cleaner column names (janitor). We then read the dataset, specifying that strings “NULL” should be treated as NA (missing values).

library(tidyverse)
library(lubridate)
library(janitor)
library(scales)

# Read the dataset
# We treat "NULL" as NA based on initial inspection of the raw text
df <- read_csv("DataDNA Dataset Challenge - Animal Shelter Operations - December 2025.csv",
    na = c("", "NA", "NULL")
)

# Display initial dimensions
dim(df)
## [1] 52343    29

2. Data Cleaning and Consistency Checks

Explanation: We standardize column names to snake_case for easier coding. Then, we inspect the data types, particularly dates.

# Standardize column names
df <- df %>% clean_names()

2.1 Date Parsing and Consistency

Explanation: The dataset contains date fields (dob, intake_date, outcome_date). We ensure they are parsed correctly as Date objects.

Quality Check: We also check for logical inconsistencies: 1. DOB should not be after Intake Date. 2. Outcome Date should not be before Intake Date.

# Convert date columns to Date type
df <- df %>%
    mutate(
        dob = ymd(dob),
        intake_date = ymd(intake_date),
        outcome_date = ymd(outcome_date)
    )

# Remove rows with logical date errors (e.g., negative duration or unborn animals)
df <- df %>%
    filter(intake_date >= dob | is.na(dob)) %>%
    filter(outcome_date >= intake_date | is.na(outcome_date))

# Output summary of missing dates
colSums(is.na(df %>% select(dob, intake_date, outcome_date)))
##          dob  intake_date outcome_date 
##         6529            0          400

2.2 Calculated Fields: Age and Duration

Explanation: We calculate the age of the animal at intake and create age groups. This is critical for analyzing adoption rates (e.g., puppies vs seniors).

df <- df %>%
    mutate(
        # Recalculate duration to ensure accuracy
        calculated_duration = as.numeric(outcome_date - intake_date),
        intake_duration = if_else(!is.na(calculated_duration), calculated_duration, intake_duration),

        # Calculate Age at Intake in Years
        age_at_intake = as.numeric(difftime(intake_date, dob, units = "days")) / 365.25,

        # Create Age Groups
        age_group = case_when(
            age_at_intake < 1 ~ "Baby (<1 yr)",
            age_at_intake >= 1 & age_at_intake < 3 ~ "Young Adult (1-3 yrs)",
            age_at_intake >= 3 & age_at_intake < 8 ~ "Adult (3-8 yrs)",
            age_at_intake >= 8 ~ "Senior (8+ yrs)",
            TRUE ~ "Unknown"
        ),

        # Extract Month, Year, and Day of Week
        intake_year = year(intake_date),
        intake_month = month(intake_date, label = TRUE),
        intake_wday = wday(intake_date, label = TRUE),

        # Seasonality
        season = case_when(
            intake_month %in% c("Dec", "Jan", "Feb") ~ "Winter",
            intake_month %in% c("Mar", "Apr", "May") ~ "Spring",
            intake_month %in% c("Jun", "Jul", "Aug") ~ "Summer",
            intake_month %in% c("Sep", "Oct", "Nov") ~ "Fall"
        )
    )

2.3 Categorical Consistency: Sex

Explanation: We separate sex into biological_sex and fixed_status.

df <- df %>%
    mutate(
        biological_sex = case_when(
            str_detect(sex, "(?i)Male|Neutered") ~ "Male",
            str_detect(sex, "(?i)Female|Spayed") ~ "Female",
            TRUE ~ "Unknown"
        ),
        fixed_status = case_when(
            str_detect(sex, "(?i)Neutered|Spayed") ~ "Fixed",
            str_detect(sex, "(?i)Male|Female") ~ "Intact",
            TRUE ~ "Unknown"
        )
    )

2.4 Duplicate Check

Explanation: Checking for exact duplicates to ensure data quality.

# Remove exact duplicates
df <- df %>% distinct()

3. Analysis

Explanation: Understanding when animals arrive helps with staffing. We summarize intake volume by Day of Week and Season.

# Day of Week Intake
df %>%
    count(intake_wday) %>%
    ggplot(aes(x = intake_wday, y = n, fill = intake_wday)) +
    geom_col() +
    coord_flip() +
    scale_fill_brewer(palette = "Pastel1") +
    theme_minimal() +
    labs(title = "Intakes by Day of Week", x = "Day", y = "Count") +
    theme(legend.position = "none")

# Seasonal Intake
df %>%
    count(season) %>%
    ggplot(aes(x = reorder(season, n), y = n, fill = season)) +
    geom_col() +
    theme_minimal() +
    labs(title = "Intakes by Season", x = "Season", y = "Count") +
    theme(legend.position = "none")

3.2 Live Release Rate (LRR) Analysis

Explanation: The Live Release Rate is a key performance indicator. It is defined as the percentage of live outcomes (Adoption, Return to Owner, Transfer) out of total outcomes.

# Define live outcomes based on column review
live_outcomes_list <- c("ADOPTION", "RETURN TO OWNER", "TRANSFER", "RESCUE", "RTOS", "FOSTER", "COMMUNITY CAT")

# Calculate LRR by Year
df %>%
    filter(!is.na(outcome_type)) %>%
    mutate(is_live = outcome_type %in% live_outcomes_list) %>%
    group_by(year = year(outcome_date)) %>%
    summarize(lrr = mean(is_live)) %>%
    ggplot(aes(x = year, y = lrr)) +
    geom_line(size = 1.2, color = "darkgreen") +
    geom_point(size = 3) +
    scale_y_continuous(labels = percent) +
    theme_minimal() +
    labs(title = "Live Release Rate Trend Over Years", y = "Live Release Rate")

LRR by Age Group and Animal Type:

df %>%
    filter(!is.na(outcome_type), !is.na(age_group)) %>%
    mutate(is_live = outcome_type %in% live_outcomes_list) %>%
    group_by(animal_type, age_group) %>%
    summarize(lrr = mean(is_live), count = n()) %>%
    filter(count > 50) %>% # Filter for statistical significance
    ggplot(aes(x = age_group, y = lrr, fill = animal_type)) +
    geom_col(position = "dodge") +
    scale_y_continuous(labels = percent) +
    theme_minimal() +
    labs(title = "Live Release Rate by Age Group and Species", y = "LRR")

3.3 Intake Condition vs. Outcome

Explanation: Does the condition at intake (e.g., Injured, Normal) predict the outcome?

df %>%
    filter(!is.na(intake_condition), !is.na(outcome_type)) %>%
    group_by(intake_condition, outcome_type) %>%
    tally() %>%
    group_by(intake_condition) %>%
    mutate(freq = n / sum(n)) %>%
    ggplot(aes(x = intake_condition, y = freq, fill = outcome_type)) +
    geom_col() +
    coord_flip() +
    scale_y_continuous(labels = percent) +
    theme_minimal() +
    labs(title = "Outcome Proportion by Intake Condition", x = "Intake Condition", y = "Proportion")

3.4 Length of Stay (LOS) Analysis

Explanation: We analyze how long animals stay in the shelter. Long stays strain resources.

# Average LOS by Animal Type
df %>%
    group_by(animal_type) %>%
    summarize(avg_los = mean(intake_duration, na.rm = TRUE), median_los = median(intake_duration, na.rm = TRUE)) %>%
    ggplot(aes(x = reorder(animal_type, avg_los), y = avg_los)) +
    geom_col(fill = "steelblue") +
    coord_flip() +
    theme_minimal() +
    labs(title = "Average Length of Stay by Species", y = "Days", x = "Species")

# Average LOS by Outcome Type
df %>%
    filter(!is.na(outcome_type)) %>%
    group_by(outcome_type) %>%
    summarize(avg_los = mean(intake_duration, na.rm = TRUE)) %>%
    ggplot(aes(x = reorder(outcome_type, avg_los), y = avg_los)) +
    geom_col(fill = "firebrick") +
    coord_flip() +
    theme_minimal() +
    labs(title = "Average Length of Stay by Outcome Type", y = "Days", x = "Outcome")

3.5 Intake Sources

Explanation: Tracking where animals come from (e.g., Stray, Owner Surrender) helps in targeting community interventions.

df %>%
    count(intake_type, sort = TRUE) %>%
    ggplot(aes(x = reorder(intake_type, n), y = n)) +
    geom_col(fill = "purple") +
    coord_flip() +
    theme_minimal() +
    labs(title = "Volume by Intake Type", x = "Intake Type", y = "Count")

3.6 Repeat Intakes (Recidivism)

Explanation: We identify animals that have entered the shelter multiple times by counting distinct Intake Dates per animal_id.

repeat_offenders <- df %>%
    group_by(animal_id) %>%
    summarize(intake_count = n_distinct(intake_date)) %>%
    filter(intake_count > 1) %>%
    arrange(desc(intake_count))

# Percentage of animals that are repeat intakes
nrow(repeat_offenders) / n_distinct(df$animal_id) * 100
## [1] 3.021858
# Inspect top repeat offenders
head(repeat_offenders, 10)
## # A tibble: 10 × 2
##    animal_id intake_count
##    <chr>            <int>
##  1 A637086              8
##  2 A593031              7
##  3 A637269              7
##  4 A279626              6
##  5 A542375              6
##  6 A610057              6
##  7 A657508              6
##  8 A354875              5
##  9 A555597              5
## 10 A575288              5

3.7 Strategic Focus Areas

Explanation: To directly answer “Where should we focus strategic effort?”, we programmatically identify the groups with the biggest challenges.

1. Who needs the most help? (Lowest Live Release Rate) Identifying the demographic with the lowest chance of a live outcome.

lrr_focus <- df %>%
    filter(!is.na(outcome_type), !is.na(age_group)) %>%
    mutate(is_live = outcome_type %in% live_outcomes_list) %>%
    group_by(animal_type, age_group) %>%
    summarize(lrr = mean(is_live), count = n()) %>%
    filter(count > 50) %>%
    arrange(lrr) %>%
    head(3)

print("Demographics with Lowest Live Release Rates:")
## [1] "Demographics with Lowest Live Release Rates:"
print(lrr_focus)
## # A tibble: 3 × 4
## # Groups:   animal_type [2]
##   animal_type age_group                lrr count
##   <chr>       <chr>                  <dbl> <int>
## 1 WILD        Young Adult (1-3 yrs) 0.0576   139
## 2 WILD        Adult (3-8 yrs)       0.134    142
## 3 OTHER       Adult (3-8 yrs)       0.219     96

2. Where are resources tied up? (Longest Length of Stay) Longer stays mean more food, medical care, and kennel space.

los_focus <- df %>%
    filter(!is.na(outcome_type)) %>%
    group_by(animal_type, outcome_type) %>%
    summarize(avg_los = mean(intake_duration, na.rm = TRUE), count = n()) %>%
    filter(count > 50) %>%
    arrange(desc(avg_los)) %>%
    head(3)

print("Groups with Longest Average Length of Stay:")
## [1] "Groups with Longest Average Length of Stay:"
print(los_focus)
## # A tibble: 3 × 4
## # Groups:   animal_type [3]
##   animal_type outcome_type avg_los count
##   <chr>       <chr>          <dbl> <int>
## 1 CAT         MISSING        288.    103
## 2 RABBIT      ADOPTION        62.1   280
## 3 GUINEA PIG  ADOPTION        58.9    80

3. Where can we prevent intake? (Top Intake Sources) Targeting the largest intake sources with community programs.

intake_focus <- df %>%
    count(intake_type, sort = TRUE) %>%
    mutate(percent = n / sum(n)) %>%
    head(3)

print("Top Intake Drivers:")
## [1] "Top Intake Drivers:"
print(intake_focus)
## # A tibble: 3 × 3
##   intake_type         n percent
##   <chr>           <int>   <dbl>
## 1 STRAY           36577  0.702 
## 2 WILDLIFE         8150  0.156 
## 3 OWNER SURRENDER  4570  0.0877

4. Exploratory Analysis: Unexpected Findings

Explanation: Per the challenge encouragement to “explore the data in your own way,” we are investigating two common shelter hypotheses to see if they hold true in Long Beach: 1. The Power of a Name: Do animals with assigned names get adopted faster than those without? 2. “Black Dog/Cat Syndrome”: Do animals with “Black” as their primary color linger longer in the shelter?

4.1 The Power of a Name

We compare the median days to adoption for animals with a recorded name versus those with “NULL” or generic placeholders.

df <- df %>%
    mutate(has_name = if_else(is.na(animal_name) | animal_name %in% c("NULL", "UNKNOWN", "STRAY"), "Unnamed", "Named"))

df %>%
    filter(outcome_type == "ADOPTION") %>%
    group_by(has_name) %>%
    summarize(
        median_days_to_adopt = median(intake_duration, na.rm = TRUE),
        count = n()
    ) %>%
    ggplot(aes(x = has_name, y = median_days_to_adopt, fill = has_name)) +
    geom_col() +
    theme_minimal() +
    labs(title = "Impact of Names on Adoption Speed", y = "Median Days to Adoption")

4.2 Testing “Black Animal Syndrome”

There is a common belief that black animals are harder to adopt. Let’s test this by comparing the Length of Stay for animals with “Black” as a primary color vs. all others.

df %>%
    filter(outcome_type == "ADOPTION") %>%
    mutate(is_black_color = if_else(primary_color == "BLACK", "Black Coat", "Other Color")) %>%
    group_by(animal_type, is_black_color) %>%
    summarize(median_los = median(intake_duration, na.rm = TRUE), count = n()) %>%
    filter(count > 100) %>% # Ensure robust sample size
    ggplot(aes(x = animal_type, y = median_los, fill = is_black_color)) +
    geom_col(position = "dodge") +
    scale_fill_manual(values = c("Black Coat" = "#333333", "Other Color" = "#999999")) +
    theme_minimal() +
    labs(title = "Does Coat Color Affect Length of Stay?", y = "Median Days to Adoption")

5. Advanced Analytics: Operational Stress Test (Daily Census)

Explanation: To truly stand out and win this challenge, we must move beyond counting events (intakes/outcomes) to measuring pressure (capacity). Standard charts show how many animals arrived, but they don’t show how many mouths there were to feed on any specific day.

We will perform a Daily Census Analysis. This involves: 1. Expanding every animal’s record into a sequence of dates representing every day they spent in the shelter. 2. Aggregating these daily records to calculate the exact shelter population for every single day in the dataset. 3. Visualizing this “Operational Stress” to pinpoint exact periods of capacity crisis.

# 1. Create a daily sequence for each animal
# Note: This operation expands the dataframe significantly, so we filter for essential columns first
census_df <- df %>%
    filter(!is.na(intake_date), !is.na(outcome_date), intake_date <= outcome_date) %>%
    select(animal_id, animal_type, intake_date, outcome_date) %>%
    mutate(day_in_shelter = map2(intake_date, outcome_date, seq, by = "day")) %>%
    unnest(day_in_shelter)

# 2. Aggregating to find daily population counts
daily_counts <- census_df %>%
    count(day_in_shelter, animal_type) %>%
    rename(date = day_in_shelter, population = n)

# 3. Visualize the Daily Capacity Stress
daily_counts %>%
    filter(year(date) >= 2018) %>% # Focus on recent relevant history
    ggplot(aes(x = date, y = population, fill = animal_type)) +
    geom_area(alpha = 0.8, size = 0.5, color = "white") +
    theme_minimal() +
    labs(
        title = "Daily Shelter Census (Operational Stress Test)",
        subtitle = "Total animals in care per day - The true measure of resource strain",
        x = "Date",
        y = "Total Animals in Shelter"
    ) +
    scale_fill_brewer(palette = "Set1")

Interpretation: This “Area Chart” is a game-changer for operations. Unlike bar charts of monthly intakes, this shows the cumulative burden. * Peaks: Identify exactly when the shelter ran out of kennel space. * Seasonality: Notice how the “Area” (total care days) usually expands in summer (Kitten Season) even if intakes don’t spike as dramatically, because kittens stay longer. * Resource Planning: You can now predict purely based on date: “On July 15th, we need 30% more food/staff vs January 1st.”