Data Source: PetFinder competition on Kaggle, 2019 ‘https://www.kaggle.com/c/petfinder-adoption-prediction/data’
PetFinder wanted to predict which of its incoming, rescued cats and dogs would be most likely to get adopted fastest. This could help them forecast occupancy at their shelters, and also could help them learn features that were “working” in the sense of speeding up adoption rates. I’m not going to look at these “craftable” features to speed up adoption, which would involve deep learning on the textual descriptions of the pets and their photos. I’ll just focus here on the static features of the pets, such as fur length and breed, although some of these features, such as vaccination status, neutering category, and adoption fees could be something PetFinder wants to know if it’s worth changing for certain animals, i.e. if the speedup in adoption time was worth the associated costs.
pets <- read_csv('pets.csv')
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## .default = col_double(),
## Name = col_character(),
## RescuerID = col_character(),
## Description = col_character(),
## PetID = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
head(pets)
## # A tibble: 6 x 24
## Type Name Age Breed1 Breed2 Gender Color1 Color2 Color3 MaturitySize
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 Nibble 3 299 0 1 1 7 0 1
## 2 2 No Name Yet 1 265 0 1 1 2 0 2
## 3 1 Brisco 1 307 0 1 2 7 0 2
## 4 1 Miko 4 307 0 2 1 2 0 2
## 5 1 Hunter 1 307 0 1 1 0 0 2
## 6 2 <NA> 3 266 0 2 5 6 0 2
## # … with 14 more variables: FurLength <dbl>, Vaccinated <dbl>, Dewormed <dbl>,
## # Sterilized <dbl>, Health <dbl>, Quantity <dbl>, Fee <dbl>, State <dbl>,
## # RescuerID <chr>, VideoAmt <dbl>, Description <chr>, PetID <chr>,
## # PhotoAmt <dbl>, AdoptionSpeed <dbl>
The response variable here is the AdoptionSpeed, an ordinal, categorical number repesenting how long it took for the pet to get adopted.
0 - Pet was adopted on the same day as it was listed. 1 - Pet was adopted between 1 and 7 days (1st week) after being listed. 2 - Pet was adopted between 8 and 30 days (1st month) after being listed. 3 - Pet was adopted between 31 and 90 days (2nd & 3rd month) after being listed. 4 - No adoption after 100 days of being listed. (There are no pets in this dataset that waited between 90 and 100 days).
About half of the speeds were 0-2, and half were 3-4, so I’ll attempt to predict whether a pet will be adopted within a month or not, using the above provided feature list. The actual Kaggle competition used text sentiment analysis modeled by analyzing the last category, “Description”, but here let’s just turn the description into a feature, like how many characters long it is, e.g.
Most of those features are categorical, as opposed to numerical or text, and although many have a natural ordering, like “FurLength” (1 = Short, 2 = Medium, 3 = Long, 0 = Not Specified), it will be more useful to keep “Not Specified” as a category and just abandon the order of the other 3 categories, rather than treating them as ordered and not being able to use the “Not Specified” ones.
# convert 'Description' to 'TextLength' and 'Name' to binary 'Named' (yes/no)
numericals <- c('Age', 'Quantity', 'Fee', 'VideoAmt', 'PhotoAmt', 'TextLength')
# convert 'AdoptionSpeed' into binary 'AdoptedFast' (yes/no)
target <- 'AdoptedFast'
categoricals <- c('Type', 'Breed1', 'Breed2', 'Gender', 'Color1', 'Color2',
'Color3', 'MaturitySize', 'FurLength', 'Vaccinated', 'Named',
'Dewormed', 'Sterilized', 'Health', 'State')
ids <- 'PetID'
First we need to create the 3 promised features– one indicating whether the pet is named or not, one indicating the length of the text description, and one converting the target variable to indicate whether the pet was adopted in less than a month or more.
pets$Name[1:30]
## [1] "Nibble" "No Name Yet"
## [3] "Brisco" "Miko"
## [5] "Hunter" NA
## [7] "BULAT" "Siu Pak & Her 6 Puppies"
## [9] NA "Kitty"
## [11] "Bear" "Kali"
## [13] "Peanut" "2 Mths Old Cute Kitties"
## [15] "Lost Dog" "Max"
## [17] "Brownie" "Blackie"
## [19] "Beauty" NA
## [21] "Godiva" "Tigers"
## [23] "Kenit, Kenot, Techit, Keyad, Owen" "Donut"
## [25] "Cikenet" "Garfield"
## [27] "No Name" "No Name"
## [29] "Hunter" "Pepper"
Besides the NA’s, there are some “No Name” and “No Name Yet” listings, as well as some descriptive statements instead of names (“2 Mths Old Cute Kitties”, “Lost Dog”). Maybe the best thing to do is just take care of the NA’s and “No Name” type listings.
pets <- pets %>%
mutate(AdoptedFast = ifelse(AdoptionSpeed < 3, 'yes', 'no')) %>%
mutate(TextLength = str_length(Description)) %>%
mutate(Named = ifelse(is.na(Name), 'no',
ifelse(str_starts(Name, "No "), 'no', 'yes'))) %>%
select(-c('AdoptionSpeed', 'Description', 'Name', 'RescuerID'))
# removed rescuer ID in order to train on 90% fewer features
head(pets[rev(names(pets))])
## # A tibble: 6 x 23
## Named TextLength AdoptedFast PhotoAmt PetID VideoAmt State Fee Quantity
## <chr> <int> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 yes 359 yes 1 86e1089a3 0 41326 100 1
## 2 no 118 yes 2 6296e909a 0 41401 0 1
## 3 yes 393 no 7 3422e4906 0 41326 0 1
## 4 yes 146 yes 8 5842f1ff5 0 41401 150 1
## 5 yes 390 yes 3 850a43f90 0 41326 0 1
## 6 no 87 yes 2 d24c30b4b 0 41326 0 1
## # … with 14 more variables: Health <dbl>, Sterilized <dbl>, Dewormed <dbl>,
## # Vaccinated <dbl>, FurLength <dbl>, MaturitySize <dbl>, Color3 <dbl>,
## # Color2 <dbl>, Color1 <dbl>, Gender <dbl>, Breed2 <dbl>, Breed1 <dbl>,
## # Age <dbl>, Type <dbl>
Make all the categorical features into factors
pets[, c(categoricals, target)] <- lapply(pets[, c(categoricals, target)], factor)
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 0.1.2 ──
## ✓ broom 0.7.3 ✓ recipes 0.1.15
## ✓ dials 0.0.9 ✓ rsample 0.0.9
## ✓ infer 0.5.4 ✓ tune 0.1.3
## ✓ modeldata 0.1.0 ✓ workflows 0.2.2
## ✓ parsnip 0.1.5 ✓ yardstick 0.0.8
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## x dplyr::collapse() masks glue::collapse()
## x scales::discard() masks purrr::discard()
## x dplyr::filter() masks stats::filter()
## x recipes::fixed() masks stringr::fixed()
## x dplyr::lag() masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step() masks stats::step()
library(skimr)
skim(pets, all_of(c(categoricals, target)))
| Name | pets |
| Number of rows | 14993 |
| Number of columns | 23 |
| _______________________ | |
| Column type frequency: | |
| factor | 16 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Type | 0 | 1 | FALSE | 2 | 1: 8132, 2: 6861 |
| Breed1 | 0 | 1 | FALSE | 176 | 307: 5927, 266: 3634, 265: 1258, 299: 342 |
| Breed2 | 0 | 1 | FALSE | 135 | 0: 10762, 307: 1727, 266: 599, 265: 321 |
| Gender | 0 | 1 | FALSE | 3 | 2: 7277, 1: 5536, 3: 2180 |
| Color1 | 0 | 1 | FALSE | 7 | 1: 7427, 2: 3750, 3: 947, 5: 884 |
| Color2 | 0 | 1 | FALSE | 7 | 0: 4471, 7: 3438, 2: 3313, 5: 1128 |
| Color3 | 0 | 1 | FALSE | 6 | 0: 10604, 7: 3221, 5: 417, 6: 378 |
| MaturitySize | 0 | 1 | FALSE | 4 | 2: 10305, 1: 3395, 3: 1260, 4: 33 |
| FurLength | 0 | 1 | FALSE | 3 | 1: 8808, 2: 5361, 3: 824 |
| Vaccinated | 0 | 1 | FALSE | 3 | 2: 7227, 1: 5898, 3: 1868 |
| Named | 0 | 1 | FALSE | 2 | yes: 13644, no: 1349 |
| Dewormed | 0 | 1 | FALSE | 3 | 1: 8397, 2: 4815, 3: 1781 |
| Sterilized | 0 | 1 | FALSE | 3 | 2: 10077, 1: 3101, 3: 1815 |
| Health | 0 | 1 | FALSE | 3 | 1: 14478, 2: 481, 3: 34 |
| State | 0 | 1 | FALSE | 14 | 413: 8714, 414: 3845, 413: 843, 413: 507 |
| AdoptedFast | 0 | 1 | FALSE | 2 | yes: 7537, no: 7456 |
Before scaling the numeric features, the data should be split into training and testing groups. That way, when models are trained, they won’t be using any information from the test data.
# set the random seed, for reproducibility, and split the data 80/20.
set.seed(607)
traintest <- initial_split(pets, prop = .80)
train_data <- training(traintest)
test_data <- testing(traintest)
Now the numerical data can be scaled, using a recipe. Some models, like decision trees, don’t need the inputs to be scaled, but most others perform better with scaled inputs. In the same recipe, you can one-hot encode the categorical variables, so that models can train on 1’s and 0’s, as they like to do. This turns one column of many values into many columns of binary values. Since the breed features, for example, include over 300 breeds, which now become over 300 columns, it’s possible some of them will only occur once in the training set, or not at all, and only once in the testing set. This will make for bad predictions and erroneous learning by the model. So step_zv (for “zero-variance”) will remove these problematic features. Also you can set the petID column to have an identifier role, where it will be ignored by the model, but will be there to help with bookkeeping, if needed.
rec <- recipe(AdoptedFast ~ ., data = train_data) %>%
update_role(all_of(ids), new_role = "ID") %>%
step_normalize(all_of(numericals)) %>%
step_dummy(all_of(categoricals)) %>%
step_zv(all_predictors())
# prep() fits the scaler to the training data
scaler <- prep(rec, training = train_data)
# and bake() transforms all data using the statistics learned by prep()
scaled_train <- bake(scaler, train_data)
scaled_test <- bake(scaler, test_data)
Now the scaled data can be used to train models. Or at least that shows conceptually how the scaler will be fit to and transform the data. But instead of exiting the pipeline so soon, recipes can fit inside of a larger pipeline called a workflow(), which manages all the scaling steps as part of a model parameter fitting and prediction process. It just needs the data, the recipe, and a model.
One natural model choice to start with might be a logistic regression classifier, such as logistic_reg() from the parsnip package.
lr_mod <- logistic_reg() %>%
set_engine('glm') # barebones log_reg, without regularization penalties
pet_workflow <- workflow() %>%
add_model(lr_mod) %>%
add_recipe(rec)
pet_workflow
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 3 Recipe Steps
##
## ● step_normalize()
## ● step_dummy()
## ● step_zv()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Logistic Regression Model Specification (classification)
##
## Computational engine: glm
Fit the model to the training set.
pet_fit <- pet_workflow %>%
fit(data = train_data)
pet_fit %>%
pull_workflow_fit() %>%
tidy()
## # A tibble: 339 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.00695 1.62 -0.00428 9.97e- 1
## 2 Age -0.304 0.0298 -10.2 1.91e-24
## 3 Quantity -0.0902 0.0279 -3.23 1.24e- 3
## 4 Fee -0.110 0.0242 -4.53 5.87e- 6
## 5 VideoAmt -0.00518 0.0207 -0.251 8.02e- 1
## 6 PhotoAmt 0.00310 0.0215 0.144 8.85e- 1
## 7 TextLength 0.0326 0.0213 1.53 1.26e- 1
## 8 Type_X2 -1.90 1.26 -1.50 1.34e- 1
## 9 Breed1_X1 -16.2 879. -0.0185 9.85e- 1
## 10 Breed1_X3 -17.4 1455. -0.0120 9.90e- 1
## # … with 329 more rows
We can set the logistic regression model to output its predictions in probabilities, rather than just binary ‘yes’/‘no’ predictions, and that way we can look at the area under the ROC as a means of evaluating the model over all choices of threshold, rather than just .50.
pet_pred <-
predict(pet_fit, test_data, type = "prob") %>%
bind_cols(test_data %>% select(AdoptedFast))
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
# bound the true values for visual inspection of predictions
pet_pred
## # A tibble: 2,998 x 3
## .pred_no .pred_yes AdoptedFast
## <dbl> <dbl> <fct>
## 1 0.493 0.507 yes
## 2 0.440 0.560 no
## 3 0.556 0.444 no
## 4 0.698 0.302 no
## 5 0.659 0.341 no
## 6 0.827 0.173 no
## 7 0.494 0.506 yes
## 8 0.689 0.311 yes
## 9 0.179 0.821 no
## 10 0.460 0.540 yes
## # … with 2,988 more rows
pet_pred %>%
roc_curve(truth = AdoptedFast, .pred_no) %>%
autoplot()
pet_pred %>%
roc_auc(truth = AdoptedFast, .pred_no)
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.658
Not great area under the curve (.658), but better than guessing (0.5). Probably adding some regularization penalties to the parameters would help. For this, you can use the same model, but run it with an ElasticNet engine (‘glmnet’ instead of the ‘glm’ used above). ElasticNet allows you to specify what proportion of L2/L1 penalties you want to apply to the weights learned by the model. L2 penalizes big weights more heavily, while L1 penalizes small weights more, relatively speaking.
Because we built this into a modular workflow(), we can just change the model part of things and run the pipeline easily.
library(glmnet)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loaded glmnet 4.1-1
# instantiate a model with 20% L1 and 80% L2, and using a penalty of 0.01 (somewhat arbitrarily)
lr_elastic <- logistic_reg(penalty = 0.01, mixture = 0.2) %>%
set_engine('glmnet')
pet_workflow <- pet_workflow %>%
update_model(lr_elastic)
pet_workflow
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 3 Recipe Steps
##
## ● step_normalize()
## ● step_dummy()
## ● step_zv()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Logistic Regression Model Specification (classification)
##
## Main Arguments:
## penalty = 0.01
## mixture = 0.2
##
## Computational engine: glmnet
pet_fit <- pet_workflow %>%
fit(data = train_data)
pet_fit %>%
pull_workflow_fit() %>%
tidy()
## # A tibble: 339 x 3
## term estimate penalty
## <chr> <dbl> <dbl>
## 1 (Intercept) -0.408 0.01
## 2 Age -0.230 0.01
## 3 Quantity -0.0813 0.01
## 4 Fee -0.0626 0.01
## 5 VideoAmt 0 0.01
## 6 PhotoAmt 0 0.01
## 7 TextLength 0.0205 0.01
## 8 Type_X2 -0.0283 0.01
## 9 Breed1_X1 -1.38 0.01
## 10 Breed1_X3 -1.80 0.01
## # … with 329 more rows
The learned weights (estimate feature above) are smaller than in the unpenalized model, but the intercept is larger. This model will overfit less, since it has smaller parameters, but will be more biased. Let’s compare how it predicts, versus the first one.
pet_pred <-
predict(pet_fit, test_data, type = "prob") %>%
bind_cols(test_data %>% select(AdoptedFast))
pet_pred
## # A tibble: 2,998 x 3
## .pred_no .pred_yes AdoptedFast
## <dbl> <dbl> <fct>
## 1 0.496 0.504 yes
## 2 0.475 0.525 no
## 3 0.552 0.448 no
## 4 0.680 0.320 no
## 5 0.632 0.368 no
## 6 0.791 0.209 no
## 7 0.497 0.503 yes
## 8 0.591 0.409 yes
## 9 0.239 0.761 no
## 10 0.466 0.534 yes
## # … with 2,988 more rows
pet_pred %>%
roc_curve(truth = AdoptedFast, .pred_no) %>%
autoplot()
pet_pred %>%
roc_auc(truth = AdoptedFast, .pred_no)
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.663
That’s barely any better than the first model. One other ML model that usually gets thrown at this type of data and task is a random forest. It handles the categorical data well, and deals with overfitting by not letting the model rely too much on any one feature, which probably won’t get used in most of the trees the model builds. Again we can just substitute it into the workflow and give it a whirl.
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
rf_mod <- rand_forest(mode = 'classification', mtry = 20,
min_n = 3, trees = 100) %>%
set_engine('randomForest')
pet_workflow <- pet_workflow %>%
update_model(rf_mod)
pet_workflow
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 3 Recipe Steps
##
## ● step_normalize()
## ● step_dummy()
## ● step_zv()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Random Forest Model Specification (classification)
##
## Main Arguments:
## mtry = 20
## trees = 100
## min_n = 3
##
## Computational engine: randomForest
pet_fit <- pet_workflow %>%
fit(data = train_data)
pet_pred <-
predict(pet_fit, test_data, type = "prob") %>%
bind_cols(test_data %>% select(AdoptedFast))
pet_pred
## # A tibble: 2,998 x 3
## .pred_no .pred_yes AdoptedFast
## <dbl> <dbl> <fct>
## 1 0.48 0.52 yes
## 2 0.46 0.54 no
## 3 0.570 0.43 no
## 4 0.46 0.54 no
## 5 0.67 0.33 no
## 6 0.92 0.08 no
## 7 0.43 0.570 yes
## 8 0.46 0.54 yes
## 9 0.38 0.62 no
## 10 0.34 0.66 yes
## # … with 2,988 more rows
pet_pred %>%
roc_curve(truth = AdoptedFast, .pred_no) %>%
autoplot()
That performs better. The logistic regression models had an ROC-AUC score of about .66, so let’s see how much better the random forest did.
pet_pred %>%
roc_auc(truth = AdoptedFast, .pred_no)
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.707
Better, but still not great. The proper thing to do now is to find better hyperparameters for the random forest model. I just guessed at mtry = 20, which specifies the number of features to randomly choose when finding the best feature for splitting the tree into 2 branches, and I also chose the model to stop splitting a branch when the number of examples on its end is below 3 (min_n = 3). And to make the model train faster, I only chose trees = 100, but there would be better predictive accuracy if that were increased to 500 or 1000.
rf2_mod <- rand_forest(mode = 'classification', mtry = 30,
min_n = 4, trees = 200) %>%
set_engine('randomForest')
pet_workflow <- pet_workflow %>%
update_model(rf2_mod)
pet_fit <- pet_workflow %>%
fit(data = train_data)
pet_pred <-
predict(pet_fit, test_data, type = "prob") %>%
bind_cols(test_data %>% select(AdoptedFast))
pet_pred
## # A tibble: 2,998 x 3
## .pred_no .pred_yes AdoptedFast
## <dbl> <dbl> <fct>
## 1 0.495 0.505 yes
## 2 0.425 0.575 no
## 3 0.575 0.425 no
## 4 0.36 0.64 no
## 5 0.575 0.425 no
## 6 0.835 0.165 no
## 7 0.4 0.6 yes
## 8 0.36 0.64 yes
## 9 0.335 0.665 no
## 10 0.355 0.645 yes
## # … with 2,988 more rows
The notable thing about those predictions is they’re more “certain” of themselves, i.e. they are further away from 0.50, in general.
pet_pred %>%
roc_curve(truth = AdoptedFast, .pred_no) %>%
autoplot()
pet_pred %>%
roc_auc(truth = AdoptedFast, .pred_no)
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.717
Use cross-fold validation with parameter grid search to find the best combination of hyperparameters, if the goal is to squeeze out better predictions from the model.
Try different ML models
Use NLP to analyze the text descriptions of the pets, rather than just using length of text as a (pretty useless) feature.
Use the photos of the pets (not included here, but available on Kaggle) as extra information for a neural network.
Link to the data on Kaggle
A blog by Paul Sharkey that helped me
A helpful tidymodels.org page