tidymodels for pets

Data Source: PetFinder competition on Kaggle, 2019 ‘https://www.kaggle.com/c/petfinder-adoption-prediction/data’

Kaggle ran a competitiion in 2019 aiming to help PetFinder rescue shelters in Malaysia.

PetFinder wanted to predict which of its incoming, rescued cats and dogs would be most likely to get adopted fastest. This could help them forecast occupancy at their shelters, and also could help them learn features that were “working” in the sense of speeding up adoption rates. I’m not going to look at these “craftable” features to speed up adoption, which would involve deep learning on the textual descriptions of the pets and their photos. I’ll just focus here on the static features of the pets, such as fur length and breed, although some of these features, such as vaccination status, neutering category, and adoption fees could be something PetFinder wants to know if it’s worth changing for certain animals, i.e. if the speedup in adoption time was worth the associated costs.

pets <- read_csv('pets.csv')

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_double(),
##   Name = col_character(),
##   RescuerID = col_character(),
##   Description = col_character(),
##   PetID = col_character()
## )
## ℹ Use `spec()` for the full column specifications.

head(pets)

## # A tibble: 6 x 24
##    Type Name          Age Breed1 Breed2 Gender Color1 Color2 Color3 MaturitySize
##   <dbl> <chr>       <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>        <dbl>
## 1     2 Nibble          3    299      0      1      1      7      0            1
## 2     2 No Name Yet     1    265      0      1      1      2      0            2
## 3     1 Brisco          1    307      0      1      2      7      0            2
## 4     1 Miko            4    307      0      2      1      2      0            2
## 5     1 Hunter          1    307      0      1      1      0      0            2
## 6     2 <NA>            3    266      0      2      5      6      0            2
## # … with 14 more variables: FurLength <dbl>, Vaccinated <dbl>, Dewormed <dbl>,
## #   Sterilized <dbl>, Health <dbl>, Quantity <dbl>, Fee <dbl>, State <dbl>,
## #   RescuerID <chr>, VideoAmt <dbl>, Description <chr>, PetID <chr>,
## #   PhotoAmt <dbl>, AdoptionSpeed <dbl>

Data Fields

PetID - Unique hash ID of pet profile
AdoptionSpeed - Categorical speed of adoption. Lower is faster. This is the value to predict.
Type - Type of animal (1 = Dog, 2 = Cat)
Name - Name of pet (NA if not named)
Age - Age of pet when listed, in months
Breed1 - Primary breed of pet
Breed2 - Secondary breed of pet, if pet is of mixed breed
Gender - Gender of pet (1 = Male, 2 = Female, 3 = Mixed, if profile represents group of pets)
Color1 - Color 1 of pet
Color2 - Color 2 of pet
Color3 - Color 3 of pet
MaturitySize - Size at maturity (1 = Small, 2 = Medium, 3 = Large, 4 = Extra Large, 0 = Not Specified)
FurLength - Fur length (1 = Short, 2 = Medium, 3 = Long, 0 = Not Specified)
Vaccinated - Pet has been vaccinated (1 = Yes, 2 = No, 3 = Not Sure)
Dewormed - Pet has been dewormed (1 = Yes, 2 = No, 3 = Not Sure)
Sterilized - Pet has been spayed / neutered (1 = Yes, 2 = No, 3 = Not Sure)
Health - Health Condition (1 = Healthy, 2 = Minor Injury, 3 = Serious Injury, 0 = Not Specified)
Quantity - Number of pets represented in profile
Fee - Adoption fee (0 = Free)
State - State location in Malaysia
RescuerID - Unique hash ID of rescuer
VideoAmt - Total uploaded videos for this pet
PhotoAmt - Total uploaded photos for this pet
Description - Profile write-up for this pet. The primary language used is English, with some in Malay or Chinese.

The response variable here is the AdoptionSpeed, an ordinal, categorical number repesenting how long it took for the pet to get adopted.

AdoptionSpeed:

0 - Pet was adopted on the same day as it was listed. 1 - Pet was adopted between 1 and 7 days (1st week) after being listed. 2 - Pet was adopted between 8 and 30 days (1st month) after being listed. 3 - Pet was adopted between 31 and 90 days (2nd & 3rd month) after being listed. 4 - No adoption after 100 days of being listed. (There are no pets in this dataset that waited between 90 and 100 days).

About half of the speeds were 0-2, and half were 3-4, so I’ll attempt to predict whether a pet will be adopted within a month or not, using the above provided feature list. The actual Kaggle competition used text sentiment analysis modeled by analyzing the last category, “Description”, but here let’s just turn the description into a feature, like how many characters long it is, e.g.

Most of those features are categorical, as opposed to numerical or text, and although many have a natural ordering, like “FurLength” (1 = Short, 2 = Medium, 3 = Long, 0 = Not Specified), it will be more useful to keep “Not Specified” as a category and just abandon the order of the other 3 categories, rather than treating them as ordered and not being able to use the “Not Specified” ones.

# convert 'Description' to 'TextLength' and 'Name' to binary 'Named' (yes/no)
numericals <- c('Age', 'Quantity', 'Fee', 'VideoAmt', 'PhotoAmt', 'TextLength')
# convert 'AdoptionSpeed' into binary 'AdoptedFast' (yes/no)
target <- 'AdoptedFast'
categoricals <- c('Type', 'Breed1', 'Breed2', 'Gender', 'Color1', 'Color2',
                  'Color3', 'MaturitySize', 'FurLength', 'Vaccinated', 'Named',
                  'Dewormed', 'Sterilized', 'Health', 'State')
ids <- 'PetID'

First we need to create the 3 promised features– one indicating whether the pet is named or not, one indicating the length of the text description, and one converting the target variable to indicate whether the pet was adopted in less than a month or more.

pets$Name[1:30]

##  [1] "Nibble"                            "No Name Yet"                      
##  [3] "Brisco"                            "Miko"                             
##  [5] "Hunter"                            NA                                 
##  [7] "BULAT"                             "Siu Pak & Her 6 Puppies"          
##  [9] NA                                  "Kitty"                            
## [11] "Bear"                              "Kali"                             
## [13] "Peanut"                            "2 Mths Old Cute Kitties"          
## [15] "Lost Dog"                          "Max"                              
## [17] "Brownie"                           "Blackie"                          
## [19] "Beauty"                            NA                                 
## [21] "Godiva"                            "Tigers"                           
## [23] "Kenit, Kenot, Techit, Keyad, Owen" "Donut"                            
## [25] "Cikenet"                           "Garfield"                         
## [27] "No Name"                           "No Name"                          
## [29] "Hunter"                            "Pepper"

Besides the NA’s, there are some “No Name” and “No Name Yet” listings, as well as some descriptive statements instead of names (“2 Mths Old Cute Kitties”, “Lost Dog”). Maybe the best thing to do is just take care of the NA’s and “No Name” type listings.

pets <- pets %>%
  mutate(AdoptedFast = ifelse(AdoptionSpeed < 3, 'yes', 'no')) %>%
  mutate(TextLength = str_length(Description)) %>%
  mutate(Named = ifelse(is.na(Name), 'no', 
                        ifelse(str_starts(Name, "No "), 'no', 'yes'))) %>%
  select(-c('AdoptionSpeed', 'Description', 'Name', 'RescuerID'))
# removed rescuer ID in order to train on 90% fewer features
head(pets[rev(names(pets))])

## # A tibble: 6 x 23
##   Named TextLength AdoptedFast PhotoAmt PetID     VideoAmt State   Fee Quantity
##   <chr>      <int> <chr>          <dbl> <chr>        <dbl> <dbl> <dbl>    <dbl>
## 1 yes          359 yes                1 86e1089a3        0 41326   100        1
## 2 no           118 yes                2 6296e909a        0 41401     0        1
## 3 yes          393 no                 7 3422e4906        0 41326     0        1
## 4 yes          146 yes                8 5842f1ff5        0 41401   150        1
## 5 yes          390 yes                3 850a43f90        0 41326     0        1
## 6 no            87 yes                2 d24c30b4b        0 41326     0        1
## # … with 14 more variables: Health <dbl>, Sterilized <dbl>, Dewormed <dbl>,
## #   Vaccinated <dbl>, FurLength <dbl>, MaturitySize <dbl>, Color3 <dbl>,
## #   Color2 <dbl>, Color1 <dbl>, Gender <dbl>, Breed2 <dbl>, Breed1 <dbl>,
## #   Age <dbl>, Type <dbl>

Make all the categorical features into factors

pets[, c(categoricals, target)] <- lapply(pets[, c(categoricals, target)], factor)

Tidymodels is/are a group of packages within the tidyverse, useful for ML modeling tasks.

library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 0.1.2 ──

## ✓ broom     0.7.3      ✓ recipes   0.1.15
## ✓ dials     0.0.9      ✓ rsample   0.0.9 
## ✓ infer     0.5.4      ✓ tune      0.1.3 
## ✓ modeldata 0.1.0      ✓ workflows 0.2.2 
## ✓ parsnip   0.1.5      ✓ yardstick 0.0.8

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## x dplyr::collapse() masks glue::collapse()
## x scales::discard() masks purrr::discard()
## x dplyr::filter()   masks stats::filter()
## x recipes::fixed()  masks stringr::fixed()
## x dplyr::lag()      masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step()   masks stats::step()

library(skimr)
skim(pets, all_of(c(categoricals, target)))

Data summary
Name	pets
Number of rows	14993
Number of columns	23
_______________________
Column type frequency:
factor	16
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
Type	1	FALSE	2	1: 8132, 2: 6861
Breed1	1	FALSE	176	307: 5927, 266: 3634, 265: 1258, 299: 342
Breed2	1	FALSE	135	0: 10762, 307: 1727, 266: 599, 265: 321
Gender	1	FALSE	3	2: 7277, 1: 5536, 3: 2180
Color1	1	FALSE	7	1: 7427, 2: 3750, 3: 947, 5: 884
Color2	1	FALSE	7	0: 4471, 7: 3438, 2: 3313, 5: 1128
Color3	1	FALSE	6	0: 10604, 7: 3221, 5: 417, 6: 378
MaturitySize	1	FALSE	4	2: 10305, 1: 3395, 3: 1260, 4: 33
FurLength	1	FALSE	3	1: 8808, 2: 5361, 3: 824
Vaccinated	1	FALSE	3	2: 7227, 1: 5898, 3: 1868
Named	1	FALSE	2	yes: 13644, no: 1349
Dewormed	1	FALSE	3	1: 8397, 2: 4815, 3: 1781
Sterilized	1	FALSE	3	2: 10077, 1: 3101, 3: 1815
Health	1	FALSE	3	1: 14478, 2: 481, 3: 34
State	1	FALSE	14	413: 8714, 414: 3845, 413: 843, 413: 507
AdoptedFast	1	FALSE	2	yes: 7537, no: 7456

Before scaling the numeric features, the data should be split into training and testing groups. That way, when models are trained, they won’t be using any information from the test data.

# set the random seed, for reproducibility, and split the data 80/20.
set.seed(607)

traintest <- initial_split(pets, prop = .80)
train_data <- training(traintest)
test_data  <- testing(traintest)

Now the numerical data can be scaled, using a recipe. Some models, like decision trees, don’t need the inputs to be scaled, but most others perform better with scaled inputs. In the same recipe, you can one-hot encode the categorical variables, so that models can train on 1’s and 0’s, as they like to do. This turns one column of many values into many columns of binary values. Since the breed features, for example, include over 300 breeds, which now become over 300 columns, it’s possible some of them will only occur once in the training set, or not at all, and only once in the testing set. This will make for bad predictions and erroneous learning by the model. So step_zv (for “zero-variance”) will remove these problematic features. Also you can set the petID column to have an identifier role, where it will be ignored by the model, but will be there to help with bookkeeping, if needed.

rec <- recipe(AdoptedFast ~ ., data = train_data) %>%
  update_role(all_of(ids), new_role = "ID") %>%
  step_normalize(all_of(numericals)) %>%
  step_dummy(all_of(categoricals)) %>%
  step_zv(all_predictors()) 
# prep() fits the scaler to the training data
scaler <- prep(rec, training = train_data)
# and bake() transforms all data using the statistics learned by prep()
scaled_train <- bake(scaler, train_data)
scaled_test <- bake(scaler, test_data)

Now the scaled data can be used to train models. Or at least that shows conceptually how the scaler will be fit to and transform the data. But instead of exiting the pipeline so soon, recipes can fit inside of a larger pipeline called a workflow(), which manages all the scaling steps as part of a model parameter fitting and prediction process. It just needs the data, the recipe, and a model.

One natural model choice to start with might be a logistic regression classifier, such as logistic_reg() from the parsnip package.

lr_mod <- logistic_reg() %>%
  set_engine('glm') # barebones log_reg, without regularization penalties

pet_workflow <- workflow() %>%
  add_model(lr_mod) %>%
  add_recipe(rec)

pet_workflow

## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 3 Recipe Steps
## 
## ● step_normalize()
## ● step_dummy()
## ● step_zv()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm

Fit the model to the training set.

pet_fit <- pet_workflow %>%
  fit(data = train_data)
pet_fit %>% 
  pull_workflow_fit() %>% 
  tidy()

## # A tibble: 339 x 5
##    term         estimate std.error statistic  p.value
##    <chr>           <dbl>     <dbl>     <dbl>    <dbl>
##  1 (Intercept)  -0.00695    1.62    -0.00428 9.97e- 1
##  2 Age          -0.304      0.0298 -10.2     1.91e-24
##  3 Quantity     -0.0902     0.0279  -3.23    1.24e- 3
##  4 Fee          -0.110      0.0242  -4.53    5.87e- 6
##  5 VideoAmt     -0.00518    0.0207  -0.251   8.02e- 1
##  6 PhotoAmt      0.00310    0.0215   0.144   8.85e- 1
##  7 TextLength    0.0326     0.0213   1.53    1.26e- 1
##  8 Type_X2      -1.90       1.26    -1.50    1.34e- 1
##  9 Breed1_X1   -16.2      879.      -0.0185  9.85e- 1
## 10 Breed1_X3   -17.4     1455.      -0.0120  9.90e- 1
## # … with 329 more rows

We can set the logistic regression model to output its predictions in probabilities, rather than just binary ‘yes’/‘no’ predictions, and that way we can look at the area under the ROC as a means of evaluating the model over all choices of threshold, rather than just .50.

pet_pred <- 
  predict(pet_fit, test_data, type = "prob") %>% 
  bind_cols(test_data %>% select(AdoptedFast))

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

  # bound the true values for visual inspection of predictions
pet_pred

## # A tibble: 2,998 x 3
##    .pred_no .pred_yes AdoptedFast
##       <dbl>     <dbl> <fct>      
##  1    0.493     0.507 yes        
##  2    0.440     0.560 no         
##  3    0.556     0.444 no         
##  4    0.698     0.302 no         
##  5    0.659     0.341 no         
##  6    0.827     0.173 no         
##  7    0.494     0.506 yes        
##  8    0.689     0.311 yes        
##  9    0.179     0.821 no         
## 10    0.460     0.540 yes        
## # … with 2,988 more rows

pet_pred %>% 
  roc_curve(truth = AdoptedFast, .pred_no) %>% 
  autoplot()

pet_pred %>% 
  roc_auc(truth = AdoptedFast, .pred_no)

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.658

Not great area under the curve (.658), but better than guessing (0.5). Probably adding some regularization penalties to the parameters would help. For this, you can use the same model, but run it with an ElasticNet engine (‘glmnet’ instead of the ‘glm’ used above). ElasticNet allows you to specify what proportion of L2/L1 penalties you want to apply to the weights learned by the model. L2 penalizes big weights more heavily, while L1 penalizes small weights more, relatively speaking.

Because we built this into a modular workflow(), we can just change the model part of things and run the pipeline easily.

library(glmnet)

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

## Loaded glmnet 4.1-1

# instantiate a model with 20% L1 and 80% L2, and using a penalty of 0.01 (somewhat arbitrarily)
lr_elastic <- logistic_reg(penalty = 0.01, mixture = 0.2) %>%
  set_engine('glmnet')

pet_workflow <- pet_workflow %>% 
  update_model(lr_elastic)
pet_workflow

## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 3 Recipe Steps
## 
## ● step_normalize()
## ● step_dummy()
## ● step_zv()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Logistic Regression Model Specification (classification)
## 
## Main Arguments:
##   penalty = 0.01
##   mixture = 0.2
## 
## Computational engine: glmnet

pet_fit <- pet_workflow %>%
  fit(data = train_data)
pet_fit %>% 
  pull_workflow_fit() %>% 
  tidy()

## # A tibble: 339 x 3
##    term        estimate penalty
##    <chr>          <dbl>   <dbl>
##  1 (Intercept)  -0.408     0.01
##  2 Age          -0.230     0.01
##  3 Quantity     -0.0813    0.01
##  4 Fee          -0.0626    0.01
##  5 VideoAmt      0         0.01
##  6 PhotoAmt      0         0.01
##  7 TextLength    0.0205    0.01
##  8 Type_X2      -0.0283    0.01
##  9 Breed1_X1    -1.38      0.01
## 10 Breed1_X3    -1.80      0.01
## # … with 329 more rows

The learned weights (estimate feature above) are smaller than in the unpenalized model, but the intercept is larger. This model will overfit less, since it has smaller parameters, but will be more biased. Let’s compare how it predicts, versus the first one.

pet_pred <- 
  predict(pet_fit, test_data, type = "prob") %>% 
  bind_cols(test_data %>% select(AdoptedFast))
pet_pred

## # A tibble: 2,998 x 3
##    .pred_no .pred_yes AdoptedFast
##       <dbl>     <dbl> <fct>      
##  1    0.496     0.504 yes        
##  2    0.475     0.525 no         
##  3    0.552     0.448 no         
##  4    0.680     0.320 no         
##  5    0.632     0.368 no         
##  6    0.791     0.209 no         
##  7    0.497     0.503 yes        
##  8    0.591     0.409 yes        
##  9    0.239     0.761 no         
## 10    0.466     0.534 yes        
## # … with 2,988 more rows

pet_pred %>% 
  roc_curve(truth = AdoptedFast, .pred_no) %>% 
  autoplot()

pet_pred %>% 
  roc_auc(truth = AdoptedFast, .pred_no)

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.663

That’s barely any better than the first model. One other ML model that usually gets thrown at this type of data and task is a random forest. It handles the categorical data well, and deals with overfitting by not letting the model rely too much on any one feature, which probably won’t get used in most of the trees the model builds. Again we can just substitute it into the workflow and give it a whirl.

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

rf_mod <- rand_forest(mode = 'classification', mtry = 20, 
                      min_n = 3, trees = 100) %>%
  set_engine('randomForest')

pet_workflow <- pet_workflow %>% 
  update_model(rf_mod)
pet_workflow

## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 3 Recipe Steps
## 
## ● step_normalize()
## ● step_dummy()
## ● step_zv()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = 20
##   trees = 100
##   min_n = 3
## 
## Computational engine: randomForest

pet_fit <- pet_workflow %>%
  fit(data = train_data)
pet_pred <- 
  predict(pet_fit, test_data, type = "prob") %>% 
  bind_cols(test_data %>% select(AdoptedFast))
pet_pred

## # A tibble: 2,998 x 3
##    .pred_no .pred_yes AdoptedFast
##       <dbl>     <dbl> <fct>      
##  1    0.48      0.52  yes        
##  2    0.46      0.54  no         
##  3    0.570     0.43  no         
##  4    0.46      0.54  no         
##  5    0.67      0.33  no         
##  6    0.92      0.08  no         
##  7    0.43      0.570 yes        
##  8    0.46      0.54  yes        
##  9    0.38      0.62  no         
## 10    0.34      0.66  yes        
## # … with 2,988 more rows

pet_pred %>% 
  roc_curve(truth = AdoptedFast, .pred_no) %>% 
  autoplot()

That performs better. The logistic regression models had an ROC-AUC score of about .66, so let’s see how much better the random forest did.

pet_pred %>% 
  roc_auc(truth = AdoptedFast, .pred_no)

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.707

Better, but still not great. The proper thing to do now is to find better hyperparameters for the random forest model. I just guessed at mtry = 20, which specifies the number of features to randomly choose when finding the best feature for splitting the tree into 2 branches, and I also chose the model to stop splitting a branch when the number of examples on its end is below 3 (min_n = 3). And to make the model train faster, I only chose trees = 100, but there would be better predictive accuracy if that were increased to 500 or 1000.

rf2_mod <- rand_forest(mode = 'classification', mtry = 30, 
                      min_n = 4, trees = 200) %>%
  set_engine('randomForest')

pet_workflow <- pet_workflow %>% 
  update_model(rf2_mod)

pet_fit <- pet_workflow %>%
  fit(data = train_data)
pet_pred <- 
  predict(pet_fit, test_data, type = "prob") %>% 
  bind_cols(test_data %>% select(AdoptedFast))
pet_pred

## # A tibble: 2,998 x 3
##    .pred_no .pred_yes AdoptedFast
##       <dbl>     <dbl> <fct>      
##  1    0.495     0.505 yes        
##  2    0.425     0.575 no         
##  3    0.575     0.425 no         
##  4    0.36      0.64  no         
##  5    0.575     0.425 no         
##  6    0.835     0.165 no         
##  7    0.4       0.6   yes        
##  8    0.36      0.64  yes        
##  9    0.335     0.665 no         
## 10    0.355     0.645 yes        
## # … with 2,988 more rows

The notable thing about those predictions is they’re more “certain” of themselves, i.e. they are further away from 0.50, in general.

pet_pred %>% 
  roc_curve(truth = AdoptedFast, .pred_no) %>% 
  autoplot()

pet_pred %>% 
  roc_auc(truth = AdoptedFast, .pred_no)

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.717

Proper next steps:

Use cross-fold validation with parameter grid search to find the best combination of hyperparameters, if the goal is to squeeze out better predictions from the model.
Try different ML models
Use NLP to analyze the text descriptions of the pets, rather than just using length of text as a (pretty useless) feature.
Use the photos of the pets (not included here, but available on Kaggle) as extra information for a neural network.

Link to the data on Kaggle
A blog by Paul Sharkey that helped me
A helpful tidymodels.org page