This report uses the Classification Based on Associations (CBA) method to analyze animal intake and outcome data from the Long Beach Animal Shelter. The dataset is available here:
Long Beach Animal Shelter Intakes and Outcomes https://longbeach.opendatasoft.com/explore/dataset/animal-shelter-intakes-and-outcomes/
The goal is to predict an animal’s final outcome (e.g., Adoption, Euthanasia, Transfer) using only information available at the moment of intake. This is treated as an intake-time prediction task, so variables that would only be known later (such as length of stay) are avoided. The analysis focuses on combinations of intake characteristics (age group, condition, intake type) that are linked to higher chances of positive versus negative outcomes.
Key variables used in this analysis include:
Several packages are required: tools for data cleaning and transformation, date handling, and association rule mining/classification. The caret package is used for train/test splitting and evaluation.
library(tidyverse)
library(lubridate)
library(janitor)
library(stringr)
library(arules)
library(arulesViz)
library(arulesCBA)
library(caret)
options(stringsAsFactors = FALSE)
set.seed(123)
The raw file contains common formatting issues (for example “NULL” saved as text, inconsistent capitalization, and extra whitespace). These issues can create artificial categories, which is a problem for association rules because categories are treated as distinct items. Text fields are therefore normalized for selected categorical columns.
Some variables are stored as text-based booleans, so a parsing step is included. Age is also grouped into broad life stages, because exact numeric age would create too many unique values and would weaken rule mining.
normalize_text <- function(x) {
x <- as.character(x)
x <- str_trim(x)
x <- str_replace_all(x, "\\s+", " ")
x[x == ""] <- NA
x[toupper(x) == "NULL"] <- NA
x <- ifelse(is.na(x), NA, str_to_title(x))
x
}
parse_bool <- function(x) {
x <- as.character(x)
x <- str_trim(x)
x <- str_to_lower(x)
case_when(
x %in% c("true", "t", "1") ~ TRUE,
x %in% c("false","f","0") ~ FALSE,
TRUE ~ NA
)
}
create_age_group <- function(age_years) {
case_when(
is.na(age_years) ~ "Unknown",
age_years < 1 ~ "Baby",
age_years < 3 ~ "Young",
age_years < 8 ~ "Adult",
TRUE ~ "Senior"
)
}
A cleaning pipeline is then applied. Text normalization is limited to relevant categorical columns to avoid altering identifiers. Intake month is extracted to capture potential seasonality effects.
df_raw <- read.csv("animal-shelter-intakes-and-outcomes.csv", stringsAsFactors = FALSE)
text_cols <- c("animal_type", "sex", "intake_condition",
"intake_type", "intake_subtype", "outcome_type")
df <- df_raw %>%
clean_names() %>%
mutate(across(all_of(text_cols), normalize_text)) %>%
mutate(
dob = ymd(dob, quiet = TRUE),
intake_date = ymd(intake_date, quiet = TRUE),
outcome_date = ymd(outcome_date, quiet = TRUE)
) %>%
mutate(intake_condition = ifelse(intake_condition == "Ill Moderatete",
"Ill Moderate", intake_condition)) %>%
mutate(
sex_base = case_when(
sex == "Neutered" ~ "Male",
sex == "Spayed" ~ "Female",
sex %in% c("Male","Female","Unknown") ~ sex,
TRUE ~ "Unknown"
),
is_sterilized = case_when(
sex %in% c("Neutered","Spayed") ~ TRUE,
sex %in% c("Male","Female") ~ FALSE,
TRUE ~ NA
)
) %>%
mutate(
age_at_intake_years = as.numeric(intake_date - dob) / 365.25,
age_at_intake_years = ifelse(!is.na(age_at_intake_years) &
(age_at_intake_years < 0 | age_at_intake_years > 30),
NA, age_at_intake_years),
age_group = create_age_group(age_at_intake_years),
intake_month = month(intake_date, label = TRUE, abbr = TRUE)
)
Outcome types are detailed in the raw dataset, so they are grouped into three categories for classification:
Animals still in the shelter are marked separately and excluded later from modeling.
positive <- c("Adoption", "Return To Owner", "Community Cat",
"Return To Wild Habitat", "Homefirst", "Foster To Adopt")
negative <- c("Euthanasia", "Died", "Disposal")
partner <- c("Transfer", "Rescue", "Transport", "Shelter, Neuter, Return")
df <- df %>%
mutate(
outcome_is_current = parse_bool(outcome_is_current),
outcome_group = case_when(
isTRUE(outcome_is_current) | is.na(outcome_type) ~ "No_Outcome_Yet",
outcome_type %in% positive ~ "Positive",
outcome_type %in% negative ~ "Negative",
outcome_type %in% partner ~ "Other_or_Partner",
TRUE ~ "Admin_or_Unknown"
)
)
Only cases with a finalized, classifiable outcome are included. Records without a final outcome (still in shelter) or unclear/admin outcomes are removed to keep the target variable consistent.
n_total <- nrow(df)
df_filtered <- df %>%
filter(outcome_group %in% c("Positive", "Negative", "Other_or_Partner"))
n_kept <- nrow(df_filtered)
n_dropped <- n_total - n_kept
print(paste("Original Cases:", n_total))
## [1] "Original Cases: 52759"
print(paste("Modeled Cases:", n_kept))
## [1] "Modeled Cases: 52108"
print(paste("Excluded:", n_dropped, "(", round(n_dropped/n_total*100, 1), "%)"))
## [1] "Excluded: 651 ( 1.2 %)"
The model’s scope is restricted to animals with a finalized, classifiable outcome. I excluded 1.2% of the dataset consisting of current residents (“No Outcome Yet”) and administrative errors.
A critical decision in this analysis was the exclusion of the intake_duration variable. In a retrospective dataset, the duration of stay is known. However, at the moment of intake, the duration is unknown. Including it would cause data leakage, rendering the prediction useless for triage.
I also removed Jurisdiction because preliminary analysis showed that over 85% of intakes come from Long Beach, making it a near-constant variable that adds noise.
Finally, predictors are converted to factors and missing values are handled by assigning “Unknown” where appropriate.
df_cba <- df_filtered %>%
transmute(
outcome_group = as.factor(outcome_group),
animal_type = as.factor(ifelse(is.na(animal_type), "Unknown", animal_type)),
sex_base = as.factor(ifelse(is.na(sex_base), "Unknown", sex_base)),
is_sterilized = as.factor(ifelse(is.na(is_sterilized), "Unknown", as.character(is_sterilized))),
intake_condition = as.factor(ifelse(is.na(intake_condition), "Unknown", intake_condition)),
intake_type = as.factor(ifelse(is.na(intake_type), "Unknown", intake_type)),
intake_subtype = as.factor(ifelse(is.na(intake_subtype), "Unknown", intake_subtype)),
age_group = as.factor(age_group),
intake_month = as.factor(intake_month)
)
Before modeling, we investigate the structure of the data, specifically focusing on seasonality and item frequency.
Shelters often experience seasonal fluctuations (e.g., kitten season in summer). A Chi-Square test was performed to verify if Intake Month is statistically dependent on Outcome Group.
chisq.test(table(df_cba$intake_month, df_cba$outcome_group))
##
## Pearson's Chi-squared test
##
## data: table(df_cba$intake_month, df_cba$outcome_group)
## X-squared = 377.85, df = 22, p-value < 2.2e-16
A Chi-Square test confirmed a statistically significant relationship between Intake Month and Outcome (p < 0.001).
mosaicplot(
table(df_cba$intake_month, df_cba$outcome_group),
main = "Outcome Distribution by Month",
col = c("salmon", "lightblue", "lightgreen"),
las = 1
)
Looking at the mosaic plot, we can see that Positive outcomes (the light
green sections) seem to shrink slightly during the summer months
compared to winter. This aligns with the idea of kitten season, where
shelters are overcrowded, making adoptions harder relative to the intake
volume.
We split the data 80/20 for training and testing. The training data is converted into transactions for rule mining and a frequency plot is used to see the most common items.
train_idx <- createDataPartition(df_cba$outcome_group, p = 0.80, list = FALSE)
train_df <- df_cba[train_idx, ]
test_df <- df_cba[-train_idx, ]
train_trans <- methods::as(train_df, "transactions")
itemFrequencyPlot(
train_trans,
topN = 15,
type = "absolute",
main = "Top 15 Most Frequent Items",
col = "cadetblue2"
)
The plot shows that Stray intakes and the Baby age group are very common. A relatively high presence of Other_or_Partner outcomes is also visible, which likely reflects frequent cooperation with rescue partners. It is also worth noting that ‘Intake Condition = Normal’ is one of the most frequent items. This suggests that while we are looking for risk factors like illness, the vast majority of animals arrive healthy, which explains why the Positive and Partner outcomes are so much more common than Negative.
Class association rules are mined using minimum support of 0.01 and minimum confidence of 0.50. These thresholds aim to keep rules meaningful while still allowing patterns for smaller classes to appear. Redundant rules are removed to reduce repetition.
cars <- mineCARs(
outcome_group ~ .,
transactions = train_trans,
support = 0.01,
confidence = 0.50,
maxlen = 6
)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE FALSE 5 0.01 1
## maxlen target ext
## 6 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 416
##
## set item appearances ...[90 item(s)] done [0.00s].
## set transactions ...[90 item(s), 41687 transaction(s)] done [0.02s].
## sorting and recoding items ... [60 item(s)] done [0.00s].
## creating transaction tree ... done [0.02s].
## checking subsets of size 1 2 3 4 5 6 done [0.06s].
## writing ... [3014 rule(s)] done [0.00s].
## creating S4 object ... done [0.01s].
cars_clean <- cars[!is.redundant(cars)]
cars_clean <- cars_clean[is.significant(cars_clean, train_trans)]
print(summary(cars_clean))
## set of 1038 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5 6
## 25 200 401 320 92
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 4.000 4.000 4.245 5.000 6.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.005205 Min. :0.5000 Min. :0.01000 Min. :1.169
## 1st Qu.:0.008804 1st Qu.:0.5723 1st Qu.:0.01295 1st Qu.:1.463
## Median :0.012330 Median :0.6489 Median :0.01923 Median :1.711
## Mean :0.024092 Mean :0.6585 Mean :0.03802 Mean :1.896
## 3rd Qu.:0.022933 3rd Qu.:0.7368 3rd Qu.:0.03529 3rd Qu.:2.203
## Max. :0.253005 Max. :0.9228 Max. :0.45036 Max. :4.268
## count
## Min. : 217
## 1st Qu.: 367
## Median : 514
## Mean : 1004
## 3rd Qu.: 956
## Max. :10547
##
## mining info:
## data ntransactions support confidence
## transactions 41687 0.01 0.5
## call
## apriori(data = transactions, parameter = parameter, appearance = list(rhs = vars$class_items, lhs = vars$feature_items), control = control)
Rules are sorted by lift to highlight patterns that are stronger than chance. The highest-lift rules often combine intake condition with intake type and age group, which matches expectations for shelter decisions.
inspect(head(sort(cars_clean, by = "lift", decreasing = TRUE), 10))
## lhs rhs support confidence coverage lift count
## [1] {animal_type=Cat,
## is_sterilized=FALSE,
## intake_condition=Ill Severe,
## intake_type=Stray} => {outcome_group=Negative} 0.009523353 0.9084668 0.01048288 4.268146 397
## [2] {animal_type=Cat,
## is_sterilized=FALSE,
## intake_condition=Ill Severe} => {outcome_group=Negative} 0.009883177 0.9035088 0.01093866 4.244852 412
## [3] {animal_type=Cat,
## is_sterilized=FALSE,
## intake_condition=Injured Severe,
## intake_type=Stray} => {outcome_group=Negative} 0.008995610 0.8907363 0.01009907 4.184845 375
## [4] {animal_type=Cat,
## is_sterilized=FALSE,
## intake_condition=Injured Severe} => {outcome_group=Negative} 0.009163528 0.8863109 0.01033895 4.164053 382
## [5] {intake_condition=Ill Severe,
## age_group=Unknown} => {outcome_group=Negative} 0.009787224 0.8644068 0.01132247 4.061143 408
## [6] {sex_base=Unknown,
## intake_condition=Ill Severe} => {outcome_group=Negative} 0.019430518 0.8607864 0.02257298 4.044134 810
## [7] {is_sterilized=Unknown,
## intake_condition=Ill Severe} => {outcome_group=Negative} 0.019430518 0.8607864 0.02257298 4.044134 810
## [8] {animal_type=Cat,
## intake_condition=Ill Severe,
## intake_type=Stray,
## intake_subtype=Field} => {outcome_group=Negative} 0.012881714 0.8430141 0.01528054 3.960637 537
## [9] {is_sterilized=FALSE,
## intake_condition=Ill Severe,
## intake_subtype=Field} => {outcome_group=Negative} 0.010866697 0.8420074 0.01290570 3.955907 453
## [10] {animal_type=Wild,
## intake_condition=Injured Severe} => {outcome_group=Negative} 0.011706287 0.8413793 0.01391321 3.952956 488
The top rules provide clear, actionable insights for shelter triage. For example:
Rule 1:
{Cat, Not Sterilized, Ill Severe, Stray} => {Negative}
Unsterilized stray cats arriving in severe health have a Lift > 4.0 for a negative outcome. This indicates a high-risk population where immediate medical or humane intervention is the standard protocol.
Another interesting pattern is visible in Rule 10:
{Animal Type=Wild, Injured Severe} => {Negative}. This
confirms that the model is correctly identifying that wildlife brought
in with severe injuries are almost universally euthanized, which is
standard practice for animal control agencies.
We can visualize the network of these rules below:
plot(head(cars_clean, 20, by = "lift"), method = "graph", engine = "htmlwidget")
A CBA classifier is trained using the mined-rule approach, producing a decision list. Rules are applied in order until one matches, otherwise, a default class is used.
cba_model <- CBA(
outcome_group ~ .,
data = train_df,
supp = 0.01,
conf = 0.50,
pruning = "M1"
)
print(cba_model)
## CBA Classifier Object
## Formula: outcome_group ~ .
## Number of rules: 432
## Default Class: Negative
## Classification method: first
## Description: CBA algorithm (Liu et al., 1998)
Predictions are generated on the test set and evaluated using a confusion matrix.
pred <- predict(cba_model, test_df)
cm <- confusionMatrix(pred, test_df$outcome_group)
print(cm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Negative Other_or_Partner Positive
## Negative 1269 500 105
## Other_or_Partner 767 2743 617
## Positive 182 1213 3025
##
## Overall Statistics
##
## Accuracy : 0.6753
## 95% CI : (0.6662, 0.6843)
## No Information Rate : 0.4276
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4925
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: Negative Class: Other_or_Partner Class: Positive
## Sensitivity 0.5721 0.6156 0.8073
## Specificity 0.9262 0.7680 0.7910
## Pos Pred Value 0.6772 0.6646 0.6844
## Neg Pred Value 0.8890 0.7278 0.8797
## Prevalence 0.2128 0.4276 0.3596
## Detection Rate 0.1218 0.2632 0.2903
## Detection Prevalence 0.1798 0.3960 0.4241
## Balanced Accuracy 0.7492 0.6918 0.7991
Accuracy and Baseline: The model got an Accuracy of 67.5%. At first glance, that might not seem amazing, but we have to compare it to the baseline. If we just guessed the most common class (Other_or_Partner) for everyone, we would only get 42.8% right. So, the model is actually improving on the baseline by about 25 percentage points, which proves it’s finding real patterns.
Kappa Statistic: The model achieved a Kappa statistic of 0.49. This is a crucial metric because our dataset is imbalanced (with Negative cases being much rarer than Other/Partner). Kappa adjusts for the accuracy that could be achieved simply by guessing based on the frequency of each class. A value of 0.49 places the model in the range of moderate agreement.
McNemar’s Test: The McNemar’s Test P-Value is effectively zero, which statistically confirms that the model’s errors are not symmetric. It is significantly more likely to misclassify a Negative animal as Partner than the other way around.
Default Class: The algorithm chose Negative as the default class. This is interesting as it suggests that Positive and Partner outcomes have very specific rules, but Negative outcomes might be what’s left over when an animal doesn’t fit those nice criteria.
The results show that the fate of an animal is strongly driven by its medical condition and age at the time of arrival. The association rules revealed that severe illness or injury, particularly in stray animals, is the most reliable predictor of a negative outcome. For example, unsterilized stray cats with severe illness form a distinct high-risk group. Additionally, the analysis confirmed that wild animals brought to the shelter almost universally face negative outcomes, likely due to specific safety and health protocols that differ from domestic pets.
Statistical tests also confirmed that the time of year plays a significant role. The seasonal fluctuations in outcomes suggest that external factors, such as the influx of kittens in the summer, affect the shelter’s capacity and the likelihood of adoption for individual animals.
The classification model achieved an accuracy of 67.5%, which is a significant improvement over the random baseline of 42.8%. The model is particularly effective at identifying animals suitable for adoption, showing high sensitivity for the positive class. However, it struggles to distinguish between negative outcomes and transfers to rescue partners. This limitation suggests that the decision to transfer an animal often depends on specific relationships with rescue organizations or available space, which are not captured in the intake data.
In a real-world setting, this model could serve as an early warning system. By automatically flagging animals that match high-risk rules, such as severely ill strays, shelter staff could prioritize these cases for immediate medical review or foster placement, potentially changing the outcome before it becomes inevitable.
Long Beach Animal Shelter Intakes and Outcomes. Long Beach Open Data. https://longbeach.opendatasoft.com/
An AI assistant was consulted for assistance with RMarkdown syntax and error debugging.