1. Introduction
The purpose of this project is to analyse data from the Long Beach
Animal Shelter and to identify patterns that are related to different
types of animal outcomes. The dataset contains information about animals
brought to the shelter, including their type, condition at intake,
intake circumstances, and final outcomes.
The main goal of the analysis is to understand how selected
characteristics observed at intake are associated with different outcome
groups. To achieve this, association rules are used, followed by a
classification model based on association rules (CBA). This approach
allows both exploration of frequent patterns and evaluation of
predictive performance using interpretable rules.
The outcome is analysed as a multi-class variable that groups
detailed shelter outcomes into broader and more stable categories.
2. Data and packages
2.1 Packages used
library(tidyverse) library(lubridate) library(janitor)
library(stringr)
library(arules) library(arulesCBA) library(arulesViz)
library(caret)
2.2 Data loading and first inspection
df_raw <- read.csv(“animal-shelter-intakes-and-outcomes.csv”,
stringsAsFactors = FALSE)
cat(“Number of rows:”, nrow(df_raw), “”) cat(“Number of columns:”,
ncol(df_raw), “”)
glimpse(df_raw)
The dataset consists of 52,759 observations and 29 variables. Many
variables are stored as text, including dates and logical values.
Several columns contain empty strings or the value “NULL”, which
requires careful cleaning before any analysis can be performed.
3. Data cleaning and preparation
Association rule methods are highly sensitive to data quality. Small
inconsistencies in category names or missing values treated as real
categories can strongly affect the results. For this reason, the
cleaning process focuses on standardising text variables, handling
missing values consistently, and creating interpretable features.
3.1 Helper functions
normalize_text <- function(x) { x <- as.character(x) x <-
str_trim(x) x <- str_replace_all(x, “+”, ” “) x[x == “”] <- NA
x[toupper(x) == “NULL”] <- NA x <- ifelse(is.na(x), NA,
str_to_title(x)) x }
parse_bool <- function(x) { x <- as.character(x) x <-
str_trim(x) x <- str_to_lower(x) case_when( x %in% c(“true”, “t”,
“1”) ~ TRUE, x %in% c(“false”,“f”,“0”) ~ FALSE, TRUE ~ NA ) }
create_age_group <- function(age_years) { case_when(
is.na(age_years) ~ “Unknown”, age_years < 1 ~ “Baby”, age_years <
3 ~ “Young”, age_years < 8 ~ “Adult”, TRUE ~ “Senior” ) }
3.2 Main cleaning steps
df <- df_raw %>% clean_names() %>%
mutate(across(where(is.character), normalize_text)) %>% mutate( dob =
ymd(dob, quiet = TRUE), intake_date = ymd(intake_date, quiet = TRUE),
outcome_date = ymd(outcome_date, quiet = TRUE) ) %>%
mutate(intake_condition = ifelse(intake_condition == “Ill Moderatete”,
“Ill Moderate”, intake_condition)) %>% mutate( is_current_month =
parse_bool(is_current_month), outcome_is_dead =
parse_bool(outcome_is_dead), outcome_is_current =
parse_bool(outcome_is_current), outcome_is_other =
parse_bool(outcome_is_other), outcome_is_alive =
parse_bool(outcome_is_alive), intake_is_dead = case_when( intake_is_dead
== “Alive On Intake” ~ FALSE, intake_is_dead == “Dead On Intake” ~ TRUE,
TRUE ~ NA ) ) %>% mutate( was_outcome_alive =
suppressWarnings(as.numeric(was_outcome_alive)), was_outcome_alive =
ifelse(is.na(was_outcome_alive), 0, was_outcome_alive),
was_outcome_alive = as.logical(was_outcome_alive) ) %>% mutate(
age_at_intake_years = as.numeric(intake_date - dob) / 365.25, age_group
= create_age_group(age_at_intake_years) ) %>% mutate(
age_at_intake_years = ifelse(!is.na(age_at_intake_years) &
(age_at_intake_years < 0 | age_at_intake_years > 40), NA,
age_at_intake_years), age_group = ifelse(is.na(age_at_intake_years),
“Unknown”, age_group) ) %>% mutate( intake_duration =
suppressWarnings(as.numeric(intake_duration)), intake_duration =
ifelse(!is.na(intake_duration) & intake_duration < 0, NA,
intake_duration) ) %>% mutate( sex_base = case_when( sex ==
“Neutered” ~ “Male”, sex == “Spayed” ~ “Female”, sex %in%
c(“Male”,“Female”,“Unknown”) ~ sex, TRUE ~ “Unknown” ), is_sterilized =
case_when( sex %in% c(“Neutered”,“Spayed”) ~ TRUE, sex %in%
c(“Male”,“Female”) ~ FALSE, TRUE ~ NA ) )
The cleaning process removes invalid values, standardises categories,
and creates new features that better reflect meaningful characteristics
of animals at intake. Age is calculated using dates and then grouped
into simple categories. Sex information is split into biological sex and
sterilisation status.
4. Outcome grouping
The original outcome variable contains many detailed categories, some
of which appear very rarely. For modelling purposes, outcomes are
grouped into broader categories that are easier to interpret and more
stable for rule learning.
positive <- c(“Adoption”, “Return To Owner”, “Community Cat”,
“Return To Wild Habitat”, “Homefirst”, “Foster To Adopt”) negative <-
c(“Euthanasia”, “Died”, “Disposal”) partner <- c(“Transfer”,
“Rescue”, “Transport”, “Shelter, Neuter, Return”) admin <-
c(“Missing”, “Duplicate”)
df <- df %>% mutate( outcome_group = case_when(
isTRUE(outcome_is_current) | is.na(outcome_type) ~ “No_Outcome_Yet”,
outcome_type %in% positive ~ “Positive”, outcome_type %in% negative ~
“Negative”, outcome_type %in% partner ~ “Other_or_Partner”, outcome_type
%in% admin ~ “Admin_or_Unknown”, TRUE ~ “Admin_or_Unknown” ) )
table(df$outcome_group)
The resulting distribution shows that most animals fall into the
Positive or Other_or_Partner categories, while No_Outcome_Yet and
Admin_or_Unknown are very rare.
5. Dataset for association rules and CBA
Only variables available at intake are used as predictors. Variables
with very high cardinality, such as exact addresses, are excluded
because they generate many rare categories and reduce the quality of
association rules.
df_cba <- df %>% transmute( outcome_group =
as.factor(outcome_group), animal_type =
as.factor(ifelse(is.na(animal_type), “Unknown”, animal_type)), sex_base
= as.factor(ifelse(is.na(sex_base), “Unknown”, sex_base)), is_sterilized
= as.factor(ifelse(is.na(is_sterilized), “Unknown”,
as.character(is_sterilized))), intake_condition =
as.factor(ifelse(is.na(intake_condition), “Unknown”, intake_condition)),
intake_type = as.factor(ifelse(is.na(intake_type), “Unknown”,
intake_type)), intake_subtype = as.factor(ifelse(is.na(intake_subtype),
“Unknown”, intake_subtype)), jurisdiction =
as.factor(ifelse(is.na(jurisdiction), “Unknown”, jurisdiction)),
age_at_intake_years = age_at_intake_years, intake_duration =
intake_duration ) %>% filter(!is.na(outcome_group))
6. Train and test split
train_idx <- createDataPartition(df_cba$outcome_group, p = 0.8,
list = FALSE) train_df <- df_cba[train_idx, ] test_df <-
df_cba[-train_idx, ]
prop.table(table(train_df\(outcome_group))
prop.table(table(test_df\)outcome_group))
The split preserves class proportions, which is important given the
strong imbalance between outcome groups.
7. Association rules and CBA
7.1 Supervised discretisation and rule mining
train_disc <- discretizeDF.supervised(outcome_group ~ ., data =
train_df, method = “mdlp”)
train_trans <- transactions(train_disc) train_trans <-
train_trans[, itemFrequency(train_trans) > 0.01]
cars <- mineCARs( outcome_group ~ ., transactions = train_trans,
support = 0.02, confidence = 0.6 )
cars <- cars[!is.redundant(cars)] cars <-
cars[is.significant(cars, train_trans)]
summary(cars)
The strongest rules are related to severe intake conditions and very
short intake durations, which are strongly associated with negative
outcomes. High lift values indicate that these rules substantially
improve prediction compared to baseline class frequencies.
7.2 CBA classifier
cba_model <- CBA( outcome_group ~ ., data = train_df, supp = 0.02,
conf = 0.6, pruning = “M1” )
pred <- predict(cba_model, test_df) cm <- confusionMatrix(pred,
test_df$outcome_group) cm
8. Results and interpretation
The classifier achieved an accuracy of approximately 0.68 on the test
set, which is substantially higher than the baseline accuracy obtained
by always predicting the most frequent class. The Kappa statistic is
around 0.51, indicating moderate agreement beyond chance.
The model performs well for the most frequent outcome groups
(Positive, Other_or_Partner, Negative). The two rare classes are not
predicted by the model. This result is expected in association
rule-based classification when minimum support thresholds are applied,
as rare classes do not generate enough frequent patterns.
9. Conclusions
The analysis shows that association rules are effective in
identifying clear and interpretable relationships between intake
characteristics and shelter outcomes. Severe medical conditions and
immediate processing are strongly linked to negative outcomes, while
healthier animals are more likely to have positive or partner-related
outcomes.
The CBA classifier provides reasonable predictive performance while
remaining transparent and easy to interpret. Its limitations for very
rare outcome groups highlight an important trade-off between rule
strength and coverage, which is inherent to association rule-based
methods.
Overall, the project demonstrates a complete workflow including data
cleaning, feature engineering, association rule mining, and supervised
classification, with results that are consistent with both the data
structure and domain intuition.
10. AI usage statement
AI tools were used to assist with debugging R code and improving
clarity of explanations. All modelling decisions, interpretations, and
final conclusions are based on the author’s understanding of the methods
and the observed results from the analysis.