2025-12-13
This report analyzes the Animal Shelter Intakes and Outcomes dataset from the City of Long Beach Animal Care Services. The goal is to uncover operational trends, understand live-release performance, and support data-driven decisions.
Per the challenge instructions, we will focus on: 1. Data cleaning and consistency checks. 2. Temporal trends (seasonal, day-of-week). 3. Live Release Rates (LRR) by various factors. 4. Intake and outcome correlations. 5. Length of Stay (LOS) analysis. 6. Repeat intakes.
Explanation: First, we load the necessary libraries
for data manipulation (tidyverse), date handling
(lubridate), and cleaner column names
(janitor). We then read the dataset, specifying that
strings “NULL” should be treated as NA (missing
values).
library(tidyverse)
library(lubridate)
library(janitor)
library(scales)
# Read the dataset
# We treat "NULL" as NA based on initial inspection of the raw text
df <- read_csv("DataDNA Dataset Challenge - Animal Shelter Operations - December 2025.csv",
na = c("", "NA", "NULL")
)
# Display initial dimensions
dim(df)## [1] 52343 29
Explanation: We standardize column names to snake_case for easier coding. Then, we inspect the data types, particularly dates.
Explanation: The dataset contains date fields
(dob, intake_date, outcome_date).
We ensure they are parsed correctly as Date objects.
Quality Check: We also check for logical
inconsistencies: 1. DOB should not be after
Intake Date. 2. Outcome Date should not be
before Intake Date.
# Convert date columns to Date type
df <- df %>%
mutate(
dob = ymd(dob),
intake_date = ymd(intake_date),
outcome_date = ymd(outcome_date)
)
# Remove rows with logical date errors (e.g., negative duration or unborn animals)
df <- df %>%
filter(intake_date >= dob | is.na(dob)) %>%
filter(outcome_date >= intake_date | is.na(outcome_date))
# Output summary of missing dates
colSums(is.na(df %>% select(dob, intake_date, outcome_date)))## dob intake_date outcome_date
## 6529 0 400
Explanation: We calculate the age of the animal at intake and create age groups. This is critical for analyzing adoption rates (e.g., puppies vs seniors).
df <- df %>%
mutate(
# Recalculate duration to ensure accuracy
calculated_duration = as.numeric(outcome_date - intake_date),
intake_duration = if_else(!is.na(calculated_duration), calculated_duration, intake_duration),
# Calculate Age at Intake in Years
age_at_intake = as.numeric(difftime(intake_date, dob, units = "days")) / 365.25,
# Create Age Groups
age_group = case_when(
age_at_intake < 1 ~ "Baby (<1 yr)",
age_at_intake >= 1 & age_at_intake < 3 ~ "Young Adult (1-3 yrs)",
age_at_intake >= 3 & age_at_intake < 8 ~ "Adult (3-8 yrs)",
age_at_intake >= 8 ~ "Senior (8+ yrs)",
TRUE ~ "Unknown"
),
# Extract Month, Year, and Day of Week
intake_year = year(intake_date),
intake_month = month(intake_date, label = TRUE),
intake_wday = wday(intake_date, label = TRUE),
# Seasonality
season = case_when(
intake_month %in% c("Dec", "Jan", "Feb") ~ "Winter",
intake_month %in% c("Mar", "Apr", "May") ~ "Spring",
intake_month %in% c("Jun", "Jul", "Aug") ~ "Summer",
intake_month %in% c("Sep", "Oct", "Nov") ~ "Fall"
)
)Explanation: We separate sex into
biological_sex and fixed_status.
df <- df %>%
mutate(
biological_sex = case_when(
str_detect(sex, "(?i)Male|Neutered") ~ "Male",
str_detect(sex, "(?i)Female|Spayed") ~ "Female",
TRUE ~ "Unknown"
),
fixed_status = case_when(
str_detect(sex, "(?i)Neutered|Spayed") ~ "Fixed",
str_detect(sex, "(?i)Male|Female") ~ "Intact",
TRUE ~ "Unknown"
)
)Explanation: Checking for exact duplicates to ensure data quality.
Explanation: Understanding when animals arrive helps with staffing. We summarize intake volume by Day of Week and Season.
# Day of Week Intake
df %>%
count(intake_wday) %>%
ggplot(aes(x = intake_wday, y = n, fill = intake_wday)) +
geom_col() +
coord_flip() +
scale_fill_brewer(palette = "Pastel1") +
theme_minimal() +
labs(title = "Intakes by Day of Week", x = "Day", y = "Count") +
theme(legend.position = "none")# Seasonal Intake
df %>%
count(season) %>%
ggplot(aes(x = reorder(season, n), y = n, fill = season)) +
geom_col() +
theme_minimal() +
labs(title = "Intakes by Season", x = "Season", y = "Count") +
theme(legend.position = "none")Explanation: The Live Release Rate is a key performance indicator. It is defined as the percentage of live outcomes (Adoption, Return to Owner, Transfer) out of total outcomes.
# Define live outcomes based on column review
live_outcomes_list <- c("ADOPTION", "RETURN TO OWNER", "TRANSFER", "RESCUE", "RTOS", "FOSTER", "COMMUNITY CAT")
# Calculate LRR by Year
df %>%
filter(!is.na(outcome_type)) %>%
mutate(is_live = outcome_type %in% live_outcomes_list) %>%
group_by(year = year(outcome_date)) %>%
summarize(lrr = mean(is_live)) %>%
ggplot(aes(x = year, y = lrr)) +
geom_line(size = 1.2, color = "darkgreen") +
geom_point(size = 3) +
scale_y_continuous(labels = percent) +
theme_minimal() +
labs(title = "Live Release Rate Trend Over Years", y = "Live Release Rate")LRR by Age Group and Animal Type:
df %>%
filter(!is.na(outcome_type), !is.na(age_group)) %>%
mutate(is_live = outcome_type %in% live_outcomes_list) %>%
group_by(animal_type, age_group) %>%
summarize(lrr = mean(is_live), count = n()) %>%
filter(count > 50) %>% # Filter for statistical significance
ggplot(aes(x = age_group, y = lrr, fill = animal_type)) +
geom_col(position = "dodge") +
scale_y_continuous(labels = percent) +
theme_minimal() +
labs(title = "Live Release Rate by Age Group and Species", y = "LRR")Explanation: Does the condition at intake (e.g., Injured, Normal) predict the outcome?
df %>%
filter(!is.na(intake_condition), !is.na(outcome_type)) %>%
group_by(intake_condition, outcome_type) %>%
tally() %>%
group_by(intake_condition) %>%
mutate(freq = n / sum(n)) %>%
ggplot(aes(x = intake_condition, y = freq, fill = outcome_type)) +
geom_col() +
coord_flip() +
scale_y_continuous(labels = percent) +
theme_minimal() +
labs(title = "Outcome Proportion by Intake Condition", x = "Intake Condition", y = "Proportion")Explanation: We analyze how long animals stay in the shelter. Long stays strain resources.
# Average LOS by Animal Type
df %>%
group_by(animal_type) %>%
summarize(avg_los = mean(intake_duration, na.rm = TRUE), median_los = median(intake_duration, na.rm = TRUE)) %>%
ggplot(aes(x = reorder(animal_type, avg_los), y = avg_los)) +
geom_col(fill = "steelblue") +
coord_flip() +
theme_minimal() +
labs(title = "Average Length of Stay by Species", y = "Days", x = "Species")# Average LOS by Outcome Type
df %>%
filter(!is.na(outcome_type)) %>%
group_by(outcome_type) %>%
summarize(avg_los = mean(intake_duration, na.rm = TRUE)) %>%
ggplot(aes(x = reorder(outcome_type, avg_los), y = avg_los)) +
geom_col(fill = "firebrick") +
coord_flip() +
theme_minimal() +
labs(title = "Average Length of Stay by Outcome Type", y = "Days", x = "Outcome")Explanation: Tracking where animals come from (e.g., Stray, Owner Surrender) helps in targeting community interventions.
df %>%
count(intake_type, sort = TRUE) %>%
ggplot(aes(x = reorder(intake_type, n), y = n)) +
geom_col(fill = "purple") +
coord_flip() +
theme_minimal() +
labs(title = "Volume by Intake Type", x = "Intake Type", y = "Count")Explanation: We identify animals that have entered
the shelter multiple times by counting distinct Intake Dates per
animal_id.
repeat_offenders <- df %>%
group_by(animal_id) %>%
summarize(intake_count = n_distinct(intake_date)) %>%
filter(intake_count > 1) %>%
arrange(desc(intake_count))
# Percentage of animals that are repeat intakes
nrow(repeat_offenders) / n_distinct(df$animal_id) * 100## [1] 3.021858
## # A tibble: 10 × 2
## animal_id intake_count
## <chr> <int>
## 1 A637086 8
## 2 A593031 7
## 3 A637269 7
## 4 A279626 6
## 5 A542375 6
## 6 A610057 6
## 7 A657508 6
## 8 A354875 5
## 9 A555597 5
## 10 A575288 5
Explanation: To directly answer “Where should we focus strategic effort?”, we programmatically identify the groups with the biggest challenges.
1. Who needs the most help? (Lowest Live Release Rate) Identifying the demographic with the lowest chance of a live outcome.
lrr_focus <- df %>%
filter(!is.na(outcome_type), !is.na(age_group)) %>%
mutate(is_live = outcome_type %in% live_outcomes_list) %>%
group_by(animal_type, age_group) %>%
summarize(lrr = mean(is_live), count = n()) %>%
filter(count > 50) %>%
arrange(lrr) %>%
head(3)
print("Demographics with Lowest Live Release Rates:")## [1] "Demographics with Lowest Live Release Rates:"
## # A tibble: 3 × 4
## # Groups: animal_type [2]
## animal_type age_group lrr count
## <chr> <chr> <dbl> <int>
## 1 WILD Young Adult (1-3 yrs) 0.0576 139
## 2 WILD Adult (3-8 yrs) 0.134 142
## 3 OTHER Adult (3-8 yrs) 0.219 96
2. Where are resources tied up? (Longest Length of Stay) Longer stays mean more food, medical care, and kennel space.
los_focus <- df %>%
filter(!is.na(outcome_type)) %>%
group_by(animal_type, outcome_type) %>%
summarize(avg_los = mean(intake_duration, na.rm = TRUE), count = n()) %>%
filter(count > 50) %>%
arrange(desc(avg_los)) %>%
head(3)
print("Groups with Longest Average Length of Stay:")## [1] "Groups with Longest Average Length of Stay:"
## # A tibble: 3 × 4
## # Groups: animal_type [3]
## animal_type outcome_type avg_los count
## <chr> <chr> <dbl> <int>
## 1 CAT MISSING 288. 103
## 2 RABBIT ADOPTION 62.1 280
## 3 GUINEA PIG ADOPTION 58.9 80
3. Where can we prevent intake? (Top Intake Sources) Targeting the largest intake sources with community programs.
intake_focus <- df %>%
count(intake_type, sort = TRUE) %>%
mutate(percent = n / sum(n)) %>%
head(3)
print("Top Intake Drivers:")## [1] "Top Intake Drivers:"
## # A tibble: 3 × 3
## intake_type n percent
## <chr> <int> <dbl>
## 1 STRAY 36577 0.702
## 2 WILDLIFE 8150 0.156
## 3 OWNER SURRENDER 4570 0.0877
Explanation: Per the challenge encouragement to “explore the data in your own way,” we are investigating two common shelter hypotheses to see if they hold true in Long Beach: 1. The Power of a Name: Do animals with assigned names get adopted faster than those without? 2. “Black Dog/Cat Syndrome”: Do animals with “Black” as their primary color linger longer in the shelter?
We compare the median days to adoption for animals with a recorded name versus those with “NULL” or generic placeholders.
df <- df %>%
mutate(has_name = if_else(is.na(animal_name) | animal_name %in% c("NULL", "UNKNOWN", "STRAY"), "Unnamed", "Named"))
df %>%
filter(outcome_type == "ADOPTION") %>%
group_by(has_name) %>%
summarize(
median_days_to_adopt = median(intake_duration, na.rm = TRUE),
count = n()
) %>%
ggplot(aes(x = has_name, y = median_days_to_adopt, fill = has_name)) +
geom_col() +
theme_minimal() +
labs(title = "Impact of Names on Adoption Speed", y = "Median Days to Adoption")There is a common belief that black animals are harder to adopt. Let’s test this by comparing the Length of Stay for animals with “Black” as a primary color vs. all others.
df %>%
filter(outcome_type == "ADOPTION") %>%
mutate(is_black_color = if_else(primary_color == "BLACK", "Black Coat", "Other Color")) %>%
group_by(animal_type, is_black_color) %>%
summarize(median_los = median(intake_duration, na.rm = TRUE), count = n()) %>%
filter(count > 100) %>% # Ensure robust sample size
ggplot(aes(x = animal_type, y = median_los, fill = is_black_color)) +
geom_col(position = "dodge") +
scale_fill_manual(values = c("Black Coat" = "#333333", "Other Color" = "#999999")) +
theme_minimal() +
labs(title = "Does Coat Color Affect Length of Stay?", y = "Median Days to Adoption")Explanation: To truly stand out and win this challenge, we must move beyond counting events (intakes/outcomes) to measuring pressure (capacity). Standard charts show how many animals arrived, but they don’t show how many mouths there were to feed on any specific day.
We will perform a Daily Census Analysis. This involves: 1. Expanding every animal’s record into a sequence of dates representing every day they spent in the shelter. 2. Aggregating these daily records to calculate the exact shelter population for every single day in the dataset. 3. Visualizing this “Operational Stress” to pinpoint exact periods of capacity crisis.
# 1. Create a daily sequence for each animal
# Note: This operation expands the dataframe significantly, so we filter for essential columns first
census_df <- df %>%
filter(!is.na(intake_date), !is.na(outcome_date), intake_date <= outcome_date) %>%
select(animal_id, animal_type, intake_date, outcome_date) %>%
mutate(day_in_shelter = map2(intake_date, outcome_date, seq, by = "day")) %>%
unnest(day_in_shelter)
# 2. Aggregating to find daily population counts
daily_counts <- census_df %>%
count(day_in_shelter, animal_type) %>%
rename(date = day_in_shelter, population = n)
# 3. Visualize the Daily Capacity Stress
daily_counts %>%
filter(year(date) >= 2018) %>% # Focus on recent relevant history
ggplot(aes(x = date, y = population, fill = animal_type)) +
geom_area(alpha = 0.8, size = 0.5, color = "white") +
theme_minimal() +
labs(
title = "Daily Shelter Census (Operational Stress Test)",
subtitle = "Total animals in care per day - The true measure of resource strain",
x = "Date",
y = "Total Animals in Shelter"
) +
scale_fill_brewer(palette = "Set1")Interpretation: This “Area Chart” is a game-changer for operations. Unlike bar charts of monthly intakes, this shows the cumulative burden. * Peaks: Identify exactly when the shelter ran out of kennel space. * Seasonality: Notice how the “Area” (total care days) usually expands in summer (Kitten Season) even if intakes don’t spike as dramatically, because kittens stay longer. * Resource Planning: You can now predict purely based on date: “On July 15th, we need 30% more food/staff vs January 1st.”