1. Introduction

The purpose of this project is to analyse data from the Long Beach Animal Shelter and to identify patterns that are related to different types of animal outcomes. The dataset contains information about animals brought to the shelter, including their type, condition at intake, intake circumstances, and final outcomes.

The main goal of the analysis is to understand how selected characteristics observed at intake are associated with different outcome groups. To achieve this, association rules are used, followed by a classification model based on association rules (CBA). This approach allows both exploration of frequent patterns and evaluation of predictive performance using interpretable rules.

The outcome is analysed as a multi-class variable that groups detailed shelter outcomes into broader and more stable categories.


2. Data and packages

2.1 Packages used

library(tidyverse) library(lubridate) library(janitor) library(stringr)

library(arules) library(arulesCBA) library(arulesViz)

library(caret)

2.2 Data loading and first inspection

df_raw <- read.csv(“animal-shelter-intakes-and-outcomes.csv”, stringsAsFactors = FALSE)

cat(“Number of rows:”, nrow(df_raw), “”) cat(“Number of columns:”, ncol(df_raw), “”)

glimpse(df_raw)

The dataset consists of 52,759 observations and 29 variables. Many variables are stored as text, including dates and logical values. Several columns contain empty strings or the value “NULL”, which requires careful cleaning before any analysis can be performed.


3. Data cleaning and preparation

Association rule methods are highly sensitive to data quality. Small inconsistencies in category names or missing values treated as real categories can strongly affect the results. For this reason, the cleaning process focuses on standardising text variables, handling missing values consistently, and creating interpretable features.

3.1 Helper functions

normalize_text <- function(x) { x <- as.character(x) x <- str_trim(x) x <- str_replace_all(x, “+”, ” “) x[x == “”] <- NA x[toupper(x) == “NULL”] <- NA x <- ifelse(is.na(x), NA, str_to_title(x)) x }

parse_bool <- function(x) { x <- as.character(x) x <- str_trim(x) x <- str_to_lower(x) case_when( x %in% c(“true”, “t”, “1”) ~ TRUE, x %in% c(“false”,“f”,“0”) ~ FALSE, TRUE ~ NA ) }

create_age_group <- function(age_years) { case_when( is.na(age_years) ~ “Unknown”, age_years < 1 ~ “Baby”, age_years < 3 ~ “Young”, age_years < 8 ~ “Adult”, TRUE ~ “Senior” ) }

3.2 Main cleaning steps

df <- df_raw %>% clean_names() %>% mutate(across(where(is.character), normalize_text)) %>% mutate( dob = ymd(dob, quiet = TRUE), intake_date = ymd(intake_date, quiet = TRUE), outcome_date = ymd(outcome_date, quiet = TRUE) ) %>% mutate(intake_condition = ifelse(intake_condition == “Ill Moderatete”, “Ill Moderate”, intake_condition)) %>% mutate( is_current_month = parse_bool(is_current_month), outcome_is_dead = parse_bool(outcome_is_dead), outcome_is_current = parse_bool(outcome_is_current), outcome_is_other = parse_bool(outcome_is_other), outcome_is_alive = parse_bool(outcome_is_alive), intake_is_dead = case_when( intake_is_dead == “Alive On Intake” ~ FALSE, intake_is_dead == “Dead On Intake” ~ TRUE, TRUE ~ NA ) ) %>% mutate( was_outcome_alive = suppressWarnings(as.numeric(was_outcome_alive)), was_outcome_alive = ifelse(is.na(was_outcome_alive), 0, was_outcome_alive), was_outcome_alive = as.logical(was_outcome_alive) ) %>% mutate( age_at_intake_years = as.numeric(intake_date - dob) / 365.25, age_group = create_age_group(age_at_intake_years) ) %>% mutate( age_at_intake_years = ifelse(!is.na(age_at_intake_years) & (age_at_intake_years < 0 | age_at_intake_years > 40), NA, age_at_intake_years), age_group = ifelse(is.na(age_at_intake_years), “Unknown”, age_group) ) %>% mutate( intake_duration = suppressWarnings(as.numeric(intake_duration)), intake_duration = ifelse(!is.na(intake_duration) & intake_duration < 0, NA, intake_duration) ) %>% mutate( sex_base = case_when( sex == “Neutered” ~ “Male”, sex == “Spayed” ~ “Female”, sex %in% c(“Male”,“Female”,“Unknown”) ~ sex, TRUE ~ “Unknown” ), is_sterilized = case_when( sex %in% c(“Neutered”,“Spayed”) ~ TRUE, sex %in% c(“Male”,“Female”) ~ FALSE, TRUE ~ NA ) )

The cleaning process removes invalid values, standardises categories, and creates new features that better reflect meaningful characteristics of animals at intake. Age is calculated using dates and then grouped into simple categories. Sex information is split into biological sex and sterilisation status.


4. Outcome grouping

The original outcome variable contains many detailed categories, some of which appear very rarely. For modelling purposes, outcomes are grouped into broader categories that are easier to interpret and more stable for rule learning.

positive <- c(“Adoption”, “Return To Owner”, “Community Cat”, “Return To Wild Habitat”, “Homefirst”, “Foster To Adopt”) negative <- c(“Euthanasia”, “Died”, “Disposal”) partner <- c(“Transfer”, “Rescue”, “Transport”, “Shelter, Neuter, Return”) admin <- c(“Missing”, “Duplicate”)

df <- df %>% mutate( outcome_group = case_when( isTRUE(outcome_is_current) | is.na(outcome_type) ~ “No_Outcome_Yet”, outcome_type %in% positive ~ “Positive”, outcome_type %in% negative ~ “Negative”, outcome_type %in% partner ~ “Other_or_Partner”, outcome_type %in% admin ~ “Admin_or_Unknown”, TRUE ~ “Admin_or_Unknown” ) )

table(df$outcome_group)

The resulting distribution shows that most animals fall into the Positive or Other_or_Partner categories, while No_Outcome_Yet and Admin_or_Unknown are very rare.


5. Dataset for association rules and CBA

Only variables available at intake are used as predictors. Variables with very high cardinality, such as exact addresses, are excluded because they generate many rare categories and reduce the quality of association rules.

df_cba <- df %>% transmute( outcome_group = as.factor(outcome_group), animal_type = as.factor(ifelse(is.na(animal_type), “Unknown”, animal_type)), sex_base = as.factor(ifelse(is.na(sex_base), “Unknown”, sex_base)), is_sterilized = as.factor(ifelse(is.na(is_sterilized), “Unknown”, as.character(is_sterilized))), intake_condition = as.factor(ifelse(is.na(intake_condition), “Unknown”, intake_condition)), intake_type = as.factor(ifelse(is.na(intake_type), “Unknown”, intake_type)), intake_subtype = as.factor(ifelse(is.na(intake_subtype), “Unknown”, intake_subtype)), jurisdiction = as.factor(ifelse(is.na(jurisdiction), “Unknown”, jurisdiction)), age_at_intake_years = age_at_intake_years, intake_duration = intake_duration ) %>% filter(!is.na(outcome_group))


6. Train and test split

train_idx <- createDataPartition(df_cba$outcome_group, p = 0.8, list = FALSE) train_df <- df_cba[train_idx, ] test_df <- df_cba[-train_idx, ]

prop.table(table(train_df\(outcome_group)) prop.table(table(test_df\)outcome_group))

The split preserves class proportions, which is important given the strong imbalance between outcome groups.


7. Association rules and CBA

7.1 Supervised discretisation and rule mining

train_disc <- discretizeDF.supervised(outcome_group ~ ., data = train_df, method = “mdlp”)

train_trans <- transactions(train_disc) train_trans <- train_trans[, itemFrequency(train_trans) > 0.01]

cars <- mineCARs( outcome_group ~ ., transactions = train_trans, support = 0.02, confidence = 0.6 )

cars <- cars[!is.redundant(cars)] cars <- cars[is.significant(cars, train_trans)]

summary(cars)

The strongest rules are related to severe intake conditions and very short intake durations, which are strongly associated with negative outcomes. High lift values indicate that these rules substantially improve prediction compared to baseline class frequencies.

7.2 CBA classifier

cba_model <- CBA( outcome_group ~ ., data = train_df, supp = 0.02, conf = 0.6, pruning = “M1” )

pred <- predict(cba_model, test_df) cm <- confusionMatrix(pred, test_df$outcome_group) cm


8. Results and interpretation

The classifier achieved an accuracy of approximately 0.68 on the test set, which is substantially higher than the baseline accuracy obtained by always predicting the most frequent class. The Kappa statistic is around 0.51, indicating moderate agreement beyond chance.

The model performs well for the most frequent outcome groups (Positive, Other_or_Partner, Negative). The two rare classes are not predicted by the model. This result is expected in association rule-based classification when minimum support thresholds are applied, as rare classes do not generate enough frequent patterns.


9. Conclusions

The analysis shows that association rules are effective in identifying clear and interpretable relationships between intake characteristics and shelter outcomes. Severe medical conditions and immediate processing are strongly linked to negative outcomes, while healthier animals are more likely to have positive or partner-related outcomes.

The CBA classifier provides reasonable predictive performance while remaining transparent and easy to interpret. Its limitations for very rare outcome groups highlight an important trade-off between rule strength and coverage, which is inherent to association rule-based methods.

Overall, the project demonstrates a complete workflow including data cleaning, feature engineering, association rule mining, and supervised classification, with results that are consistent with both the data structure and domain intuition.


10. AI usage statement

AI tools were used to assist with debugging R code and improving clarity of explanations. All modelling decisions, interpretations, and final conclusions are based on the author’s understanding of the methods and the observed results from the analysis.