DATA621: Final Project

Overview

There are millions of stray pets around the world, some of which are fortunate enough to be adopted while many others are not. While adoption of a pet is often the definition of success, the rate at which a pet is adopted is also a key success factor - pets that take a long time to adopt contribute to over-crowded animal shelters and can prevent taking on new strays. Sadly, pets that are not adopted eventually need to be euthanized.

Learn more about the data

About the data

PetID - Unique hash ID of pet profile
AdoptionSpeed - response variable Categorical speed of adoption. Lower is faster.
Type - Type of animal (1 = Dog, 2 = Cat)
Name - Name of pet (Empty if not named)
Age - Age of pet when listed, in months
Breed1 - Primary breed of pet (Refer to BreedLabels dictionary)
Breed2 - Secondary breed of pet, if pet is of mixed breed (Refer to BreedLabels dictionary)
Gender - Gender of pet (1 = Male, 2 = Female, 3 = Mixed, if profile represents group of pets)
Color1 - Color 1 of pet (Refer to ColorLabels dictionary)
Color2 - Color 2 of pet (Refer to ColorLabels dictionary)
Color3 - Color 3 of pet (Refer to ColorLabels dictionary)
MaturitySize - Size at maturity (1 = Small, 2 = Medium, 3 = Large, 4 = Extra Large, 0 = Not Specified)
FurLength - Fur length (1 = Short, 2 = Medium, 3 = Long, 0 = Not Specified)
Vaccinated - Pet has been vaccinated (1 = Yes, 2 = No, 3 = Not Sure)
Dewormed - Pet has been dewormed (1 = Yes, 2 = No, 3 = Not Sure)
Sterilized - Pet has been spayed / neutered (1 = Yes, 2 = No, 3 = Not Sure)
Health - Health Condition (1 = Healthy, 2 = Minor Injury, 3 = Serious Injury, 0 = Not Specified)
Quantity - Number of pets represented in profile
Fee - Adoption fee (0 = Free)
State - State location in Malaysia (Refer to StateLabels dictionary)
RescuerID - Unique hash ID of rescuer
VideoAmt - Total uploaded videos for this pet
PhotoAmt - Total uploaded photos for this pet
Description - Profile write-up for this pet. The primary language used is English, with some in Malay or Chinese.

What to Predict

Predictor (Adoption Speed) Description: Predict how quickly, if at all, a pet is adopted.

The values are determined in the following way: 0 - Pet was adopted on the same day as it was listed. 1 - Pet was adopted between 1 and 7 days (1st week) after being listed. 2 - Pet was adopted between 8 and 30 days (1st month) after being listed. 3 - Pet was adopted between 31 and 90 days (2nd & 3rd month) after being listed. 4 - No adoption after 100 days of being listed.

This Notebook…

The data has no missing values, but there are a number of features in text that need to be converted to some numeric value. This notebook performs those changes.

knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(message = FALSE)
knitr::opts_chunk$set(warning = FALSE)

library(dplyr)
library(reshape)
library(ggplot2)
library(purrr)
library(psych)
library(tidyr)
library(corrplot)
library(forcats)
library(kableExtra)
library(stats)
library(GGally)
library(wordcloud)
library(caret )

Load Data

Load the data

##   Type                  Name Age Breed1 Breed2 Gender Color1 Color2 Color3
## 1    1              Lil Milo   2      0     26      2      2      0      0
## 2    1 Bella 4 Months Puppy!   4      0    307      2      2      3      0
##   MaturitySize FurLength Vaccinated Dewormed Sterilized Health Quantity Fee
## 1            2         1          1        1          2      1        1   0
## 2            2         1          1        1          2      1        1 100
##   State                        RescuerID VideoAmt
## 1 41326 1a2113010d6048d5410b265347b35c91        0
## 2 41326 3673e167fc9932b13149bed1f2a0180a        0
##                                                                                                                                                                                                                             Description
## 1                                              Milo went missing after a week with her new adoptive family. Only 3 months old, light brown coat. Missing from Jalan Kiara, Bandar Botanic, Klang. Please call Su at if you've seen her.
## 2 She's only 4 months old, very friendly and loving. Loves attention. A little naughty sometimes. But she's adorable. I adopted her from MDDB, but recently I have just moved to a condo. Im finding a perfect and loving home for her.
##       PetID PhotoAmt AdoptionSpeed breedname
## 1 375905770        3             3      <NA>
## 2 da8d4a273        5             4      <NA>

Missing Values Count

There are no missing values

map(data, ~sum(is.na(.))) %>% t()

##      Type Name Age Breed1 Breed2 Gender Color1 Color2 Color3 MaturitySize
## [1,] 0    0    0   0      0      0      0      0      0      0           
##      FurLength Vaccinated Dewormed Sterilized Health Quantity Fee State
## [1,] 0         0          0        0          0      0        0   0    
##      RescuerID VideoAmt Description PetID PhotoAmt AdoptionSpeed breedname
## [1,] 0         0        0           0     0        0             5

Show the type of each data

str(data)

## 'data.frame':    14993 obs. of  25 variables:
##  $ Type         : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Name         : chr  "Lil Milo" "Bella 4 Months Puppy!" "" "\"Boy Boy\"" ...
##  $ Age          : int  2 4 3 72 2 5 24 3 0 24 ...
##  $ Breed1       : int  0 0 0 0 0 1 1 3 5 5 ...
##  $ Breed2       : int  26 307 266 307 205 0 0 0 0 307 ...
##  $ Gender       : int  2 2 3 1 2 2 3 1 2 2 ...
##  $ Color1       : int  2 2 1 1 2 1 4 2 1 3 ...
##  $ Color2       : int  0 3 4 2 5 4 0 0 2 5 ...
##  $ Color3       : int  0 0 7 0 7 7 0 0 0 0 ...
##  $ MaturitySize : int  2 2 1 2 1 2 2 2 1 2 ...
##  $ FurLength    : int  1 1 1 2 1 1 1 2 1 2 ...
##  $ Vaccinated   : int  1 1 2 2 2 2 1 2 2 1 ...
##  $ Dewormed     : int  1 1 1 2 2 2 1 1 2 1 ...
##  $ Sterilized   : int  2 2 2 2 2 2 1 2 2 1 ...
##  $ Health       : int  1 1 1 1 1 1 1 1 2 1 ...
##  $ Quantity     : int  1 1 3 1 1 2 2 4 1 1 ...
##  $ Fee          : int  0 100 0 0 1 0 0 0 0 0 ...
##  $ State        : int  41326 41326 41401 41326 41336 41326 41330 41326 41401 41326 ...
##  $ RescuerID    : chr  "1a2113010d6048d5410b265347b35c91" "3673e167fc9932b13149bed1f2a0180a" "f7cff59d10c867bdee12c3f35f34d086" "94b991f8dc1e0bb903ca8d4d492c8d43" ...
##  $ VideoAmt     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Description  : chr  "Milo went missing after a week with her new adoptive family. Only 3 months old, light brown coat. Missing from "| __truncated__ "She's only 4 months old, very friendly and loving. Loves attention. A little naughty sometimes. But she's adora"| __truncated__ "Mama cat came to house and gave birth to these 03 lovely kittens, please adopt them and give them a home sweet home." "He is a stray dog found wandering around University Putra Malaysia (UPM), Serdang main campus. I have been told"| __truncated__ ...
##  $ PetID        : chr  "375905770" "da8d4a273" "27e74e45c" "7b5bee232" ...
##  $ PhotoAmt     : int  3 5 11 5 0 2 5 3 2 2 ...
##  $ AdoptionSpeed: int  3 4 2 4 3 4 4 4 1 4 ...
##  $ breedname    : chr  NA NA NA NA ...

New features

AgeYears - provided in months this is too many variations. Decided to convert to new feature of AgeYears and drop Age

data$AgeYears = round(data$Age/12, 0)

NumColors - derived by counting the number of colors a pet has using the Color1,2, and 3 features - keep these features, but given there are 7 color categories the value of this feature is questionable in doubt - create a NumColors feature with a 1,2, or 3 respectively

num_colors_vec = c()
for (i in 1:nrow(data)){
  ncolors = 1
  
  color2 <- data[i,8]
  color3 <- data[i,9]
  
  if (color2 > 0){ncolors = ncolors + 1}
  if (color3 > 0){ncolors = ncolors + 1}
  
  num_colors_vec <- c(num_colors_vec, ncolors)
}

data$NumColors = num_colors_vec

AllMeds = the sum of Vaccinated, Dewormed, and Sterilized. Combining these into one feature could potentially reduce multi-collinearity

data$AllMeds = data$Vaccinated + data$Dewormed + data$Sterilized

Remove features

Remove the following features… Name - this is a text field and new owners can (and usually do) rename their pets, so removing this feature Age - replaced with AgeYears (see above) Breed2 - 10,700 of almost 15,000 rows are populated with 0 (unknown), so this doesn’t seem like a good feature to keep State - initially kept this, but the correlation to AdoptionSpeed is only about 2%. Decided to remove it RescuerID - common sense is that this field will not have any predictive value for adoption rate Description - this will have value in future analysis for NLP, but this will be evaluated differently in another notebook PetID - same reason as RescuerID

data = subset(data, select = -c(Name, Age, Breed2, State, RescuerID, Description, PetID))

Show the updated data type

sapply(data,class)

##          Type        Breed1        Gender        Color1        Color2 
##     "integer"     "integer"     "integer"     "integer"     "integer" 
##        Color3  MaturitySize     FurLength    Vaccinated      Dewormed 
##     "integer"     "integer"     "integer"     "integer"     "integer" 
##    Sterilized        Health      Quantity           Fee      VideoAmt 
##     "integer"     "integer"     "integer"     "integer"     "integer" 
##      PhotoAmt AdoptionSpeed     breedname      AgeYears     NumColors 
##     "integer"     "integer"   "character"     "numeric"     "numeric" 
##       AllMeds 
##     "integer"

#Change categorical variables from integer to factors

data <- transform(
  data,
  Type=as.factor(Type),
  Breed1=as.factor(Breed1),
  Gender=as.factor(Gender),
  Color1=as.factor(Color1),
  Color2=as.factor(Color2),
  Color3=as.factor(Color3),
  MaturitySize=as.factor(MaturitySize),
  FurLength=as.factor(FurLength),
  Vaccinated = as.factor(Vaccinated),
  Dewormed = as.factor(Dewormed),
  Sterilized = as.factor(Sterilized),
  Health = as.factor(Health),
  AdoptionSpeed = as.factor(AdoptionSpeed),
  NumColors = as.factor(NumColors)
)

str(data)

## 'data.frame':    14993 obs. of  21 variables:
##  $ Type         : Factor w/ 2 levels "1","2": 1 1 2 1 1 1 1 1 1 1 ...
##  $ Breed1       : Factor w/ 176 levels "0","1","3","5",..: 1 1 1 1 1 2 2 3 4 4 ...
##  $ Gender       : Factor w/ 3 levels "1","2","3": 2 2 3 1 2 2 3 1 2 2 ...
##  $ Color1       : Factor w/ 7 levels "1","2","3","4",..: 2 2 1 1 2 1 4 2 1 3 ...
##  $ Color2       : Factor w/ 7 levels "0","2","3","4",..: 1 3 4 2 5 4 1 1 2 5 ...
##  $ Color3       : Factor w/ 6 levels "0","3","4","5",..: 1 1 6 1 6 6 1 1 1 1 ...
##  $ MaturitySize : Factor w/ 4 levels "1","2","3","4": 2 2 1 2 1 2 2 2 1 2 ...
##  $ FurLength    : Factor w/ 3 levels "1","2","3": 1 1 1 2 1 1 1 2 1 2 ...
##  $ Vaccinated   : Factor w/ 3 levels "1","2","3": 1 1 2 2 2 2 1 2 2 1 ...
##  $ Dewormed     : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 1 1 2 1 ...
##  $ Sterilized   : Factor w/ 3 levels "1","2","3": 2 2 2 2 2 2 1 2 2 1 ...
##  $ Health       : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 1 ...
##  $ Quantity     : int  1 1 3 1 1 2 2 4 1 1 ...
##  $ Fee          : int  0 100 0 0 1 0 0 0 0 0 ...
##  $ VideoAmt     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PhotoAmt     : int  3 5 11 5 0 2 5 3 2 2 ...
##  $ AdoptionSpeed: Factor w/ 5 levels "0","1","2","3",..: 4 5 3 5 4 5 5 5 2 5 ...
##  $ breedname    : chr  NA NA NA NA ...
##  $ AgeYears     : num  0 0 0 6 0 0 2 0 0 2 ...
##  $ NumColors    : Factor w/ 3 levels "1","2","3": 1 2 3 2 3 3 1 1 2 2 ...
##  $ AllMeds      : int  4 4 5 6 6 6 3 5 6 3 ...

Add Labels for Colors

Import the color labels to see the names of the colors.

colors <- read.csv('./data/color_labels.csv')

glimpse(colors)

## Rows: 7
## Columns: 2
## $ ColorID   <int> 1, 2, 3, 4, 5, 6, 7
## $ ColorName <chr> "Black", "Brown", "Golden", "Yellow", "Cream", "Gray", "Whit…

Checking the levels of Color1, Color2, and Color3

levels(data$Color1)

## [1] "1" "2" "3" "4" "5" "6" "7"

levels(data$Color2)

## [1] "0" "2" "3" "4" "5" "6" "7"

levels(data$Color3)

## [1] "0" "3" "4" "5" "6" "7"

Creating new variables to name the colors.

data$ColorName1 <- recode_factor(data$Color1,"1"= "Black", 
                                "2"= "Brown", 
                                "3"= "Golden", 
                                "4"= "Yellow", 
                                "5"= "Cream", 
                                "6"= "Gray", 
                                "7"= "White")

data$ColorName2 <- recode_factor(data$Color2, "0" = "NA",
                                "2"= "Brown", 
                                "3"= "Golden", 
                                "4"= "Yellow", 
                                "5"= "Cream", 
                                "6"= "Gray", 
                                "7"= "White")

data$ColorName3 <- recode_factor(data$Color3, "0" = "NA",
                                "3"= "Golden", 
                                "4"= "Yellow", 
                                "5"= "Cream", 
                                "6"= "Gray", 
                                "7"= "White")

Converting the char version of “NA” introduced in ColorNames 1-3 to type NA.

data[data=="NA"]<-NA

Dog Breed Data woof

Lets check how many dogs and cats we have:

table(data$Type)

## 
##    1    2 
## 8132 6861

It looks like we have 1262 more dogs than cats, with 8132 dogs and 6861 cats.

Subset data by Type to focus on dogs.

dog_data = filter(data, Type == 1)
dog_data = subset(dog_data, select = -c(Type))

Dog (data) Wrangling

dog_data$breedname <- gsub("Amer ", "American ", dog_data$breedname)

Next we will collapse duplicate Labrador Retriever factors (Black, Chocolate, and Yellow) into one as they are not considered to be separate Retriever breeds.

dog_data$breedname <- fct_collapse(dog_data$breedname, 
                                   'Labrador Retriever' = 
                                     c("Black Labrador Retriever",
                                       "Chocolate Labrador Retriever",
                                       "Yellow Labrador Retriever"))

Below are the top 10 breeds where Mixed Breed has a heavy majority with more than half of the dogs being categorized as a mix.

table(dog_data$breedname) %>% 
  as.data.frame() %>% 
  arrange(desc(Freq)) %>% 
  top_n(10)

##                   Var1 Freq
## 1          Mixed Breed 5923
## 2   Labrador Retriever  226
## 3             Shih Tzu  189
## 4               Poodle  167
## 5              Terrier  161
## 6     Golden Retriever  151
## 7  German Shepherd Dog   98
## 8               Beagle   90
## 9                Spitz   89
## 10          Rottweiler   88

  #subset(Freq!=0)

Size Label by Breed

When deciding to adopt a new pet a consideration many people have is the size of the dog for a number of reasons. Size plays an important role in a dog’s strength and activity level which is a concern for a potential new owner, in addition to space in ones house or if they have a backyard or not.

Next we bring in the AKC Breed Info table to load details on dog sizes by breed.

breed_size <- read.csv('./data/TrainingData/AKC Breed Info.csv')

str(breed_size)

## 'data.frame':    150 obs. of  5 variables:
##  $ BreedName         : chr  "Akita" "Anatolian Sheepdog" "Bernese Mountain Dog" "Bloodhound" ...
##  $ height_low_inches : chr  "26" "27" "23" "24" ...
##  $ height_high_inches: chr  "28" "29" "27" "26" ...
##  $ weight_low_lbs    : int  80 100 85 80 70 100 120 95 130 90 ...
##  $ weight_high_lbs   : int  120 150 110 120 100 130 160 120 150 150 ...

Remove one Fox Terrier of the two entries. Both entries have the same size data, only difference is hair type (smooth vs. wire).

breed_size <- breed_size %>% 
  slice(-116)

breed_size$BreedName <- recode(breed_size$BreedName, 
                               "Collie (Rough) & (Smooth)" = "Collie",
                               "Fox Terrier ‰ÛÒ Wirehair" = "Fox Terrier")

Using the breed_size table, a feature can be created using the American Kennel Club Breeds by Size breakdown.

Alternatively, here is another breed type breakdown by size.

breed_size$avg_weight <- rowMeans(breed_size[,c('weight_low_lbs', 
                                                'weight_high_lbs')], na.rm=TRUE)

Since there are two types of Cocker Spaniels (American and English) in our breed_size table, and our dog_data has Cocker Spaniel, I took the average of both the American and English to give an idea of what a general Cocker Spaniel might weigh to then be able to classify its size.

#[row,column]
breed_size$height_low_inches <- as.integer(breed_size$height_low_inches)
breed_size$height_high_inches <- as.integer(breed_size$height_high_inches)

#cocker spaniel avg
cs <- sapply(breed_size[c(68,69),],mean)
breed_size<-rbind(breed_size, cs)

breed_size[150,1] = "Cocker Spaniel"
breed_size[68,1] ="American Cocker Spaniel"
breed_size[69,1] ="English Cocker Spaniel"

#poodle avg
poodle <- sapply(breed_size[c(122,48,147),],mean)
breed_size<-rbind(breed_size, poodle)
breed_size[151,1] = "Poodle"

Add feature

Create a size feature based on breed using case_when()¹.

breed_size<- breed_size %>% 
  mutate(
    size = case_when(
      (avg_weight >= 3 & avg_weight < 9) ~ "toy",
      (avg_weight >= 9 & avg_weight < 35) ~ "small",
      (avg_weight >= 35 & avg_weight < 65) ~ "medium",
      (avg_weight >= 65 & avg_weight < 85) ~ "large",
      (avg_weight >= 85) ~ "giant"
    )
  )

Now we will left_join() ² our dog_data with our breed_size to bring in the size feature by breed.

dog_data_merge <- left_join(dog_data, select(breed_size, c(1,7)), 
                            by=c("breedname" = "BreedName"))

dog_data_merge$size <- as.factor(dog_data_merge$size)

table(dog_data_merge$size) %>% 
  as.data.frame() %>% 
  arrange(desc(Freq))

##     Var1 Freq
## 1  small  608
## 2  large  557
## 3 medium  140
## 4  giant  119
## 5    toy   84

Hypoallergenic

What is a hypoallergenic dog?

According to the American College of Allergy, Asthma, and Immunology, as much as 10% of the U.S. population is allergic to dogs. While there are no 100% hypoallergenic dogs, there are many breeds that do well with allergy sufferers. Dander, which is attached to pet hair, is what causes most pet allergies in humans and these dogs have a non-shedding coat that produces less dander.

So if you are either part of the 10% of the U.s. population that is allergic to dogs or just don’t like to have to vacuum dog hair twice a day then a hypoallergenic dog will be more appealing to you than a non-hypoallergenic dog.

Add feature

Below we create a dataframe using the list of hypoallergenic dogs from the American Kennel Club. Once created, we left_join() it to our dog_data_merge dataframe to add the new feature for analysis.

breed <- c("Affenpinscher", "Afgan Hound", "American Hairless Terrier", 
                    "Barbado da Terceira", "Bedlington Terrier", "Bichon Frise",
                    "Bolongese", "Chinese Crested", "Coton de Tulear", 
                    "Giant Schnauzer", "Irish Water Spaniel", 
                    "Kerry Blue Terrier","Lagotto Romangnolo", "Lowchen", 
                    "Maltese","Miniature Shnauzer","Peruvian Inca Orchid", 
                    "Poodle", "Standard Poodle", "Schnauzer", 
                    "Portuguese Water Dog", "Russian Tsvetnaya Bolonka", 
                    "Wheaten Terrier","Xoloitzcuintli", "Yorkshire Terrier"
                    )
hypoallergenic <- as.data.frame(breed)
hypoallergenic$hypoallergenic <- "Yes"

dog_data_merge1 <- left_join(dog_data_merge, hypoallergenic, 
                             by=c("breedname" = "breed"))

#when breedname = Mixed Breed & NA hypo = NA, all else should be "No" 
dog_data_merge1$hypoallergenic[is.na(dog_data_merge1$hypoallergenic)] <- "No"

dog_data_merge1$hypoallergenic[which(dog_data_merge1$breedname == "Mixed Breed")] <- NA
dog_data_merge1$hypoallergenic[which(is.na(dog_data_merge1$breedname))] <- NA

dog_data_merge1$hypoallergenic <- as.factor(dog_data_merge1$hypoallergenic)

How many hypoallergenic and non-hypoallergenic do we have?

table(dog_data_merge1$hypoallergenic)

## 
##   No  Yes 
## 1942  263

sum(table(dog_data_merge1$hypoallergenic))

## [1] 2205

Mixed Breeds and breeds that are NA in the breedname column will be NA for hypoallergenic as it logically can’t be confirmed if they are hypoallergenic or not. In general, there are Mixed Breeds that can be considered hypoallergenic. For example, a dog that is part Poodle (hypoallergenic) and part cavalier King Charles Spaniel (non-hypoallergenic) can be considered hypoallergenic, the resulting pup would is often called Cavapoo or Cavoodle.

How many NA breed do we have?

table(is.na(dog_data_merge1$breedname))

## 
## FALSE  TRUE 
##  8128     4

How many are Mixed Breed?

table(dog_data_merge1$breedname=="Mixed Breed")

## 
## FALSE  TRUE 
##  2205  5923

Distributions of Numeric Variables

Here we check our data with the added features of size and hypoallergenic.

str(dog_data_merge1)

## 'data.frame':    8132 obs. of  25 variables:
##  $ Breed1        : Factor w/ 176 levels "0","1","3","5",..: 1 1 1 1 2 2 3 4 4 5 ...
##  $ Gender        : Factor w/ 3 levels "1","2","3": 2 2 1 2 2 3 1 2 2 1 ...
##  $ Color1        : Factor w/ 7 levels "1","2","3","4",..: 2 2 1 2 1 4 2 1 3 2 ...
##  $ Color2        : Factor w/ 7 levels "0","2","3","4",..: 1 3 2 5 4 1 1 2 5 1 ...
##  $ Color3        : Factor w/ 6 levels "0","3","4","5",..: 1 1 1 6 6 1 1 1 1 1 ...
##  $ MaturitySize  : Factor w/ 4 levels "1","2","3","4": 2 2 2 1 2 2 2 1 2 2 ...
##  $ FurLength     : Factor w/ 3 levels "1","2","3": 1 1 2 1 1 1 2 1 2 1 ...
##  $ Vaccinated    : Factor w/ 3 levels "1","2","3": 1 1 2 2 2 1 2 2 1 1 ...
##  $ Dewormed      : Factor w/ 3 levels "1","2","3": 1 1 2 2 2 1 1 2 1 1 ...
##  $ Sterilized    : Factor w/ 3 levels "1","2","3": 2 2 2 2 2 1 2 2 1 3 ...
##  $ Health        : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 2 1 1 ...
##  $ Quantity      : int  1 1 1 1 2 2 4 1 1 1 ...
##  $ Fee           : int  0 100 0 1 0 0 0 0 0 500 ...
##  $ VideoAmt      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PhotoAmt      : int  3 5 5 0 2 5 3 2 2 2 ...
##  $ AdoptionSpeed : Factor w/ 5 levels "0","1","2","3",..: 4 5 5 4 5 5 5 2 5 5 ...
##  $ breedname     : chr  NA NA NA NA ...
##  $ AgeYears      : num  0 0 6 0 0 2 0 0 2 1 ...
##  $ NumColors     : Factor w/ 3 levels "1","2","3": 1 2 2 3 3 1 1 2 2 1 ...
##  $ AllMeds       : int  4 4 6 6 6 3 5 6 3 5 ...
##  $ ColorName1    : Factor w/ 7 levels "Black","Brown",..: 2 2 1 2 1 4 2 1 3 2 ...
##  $ ColorName2    : Factor w/ 7 levels "NA","Brown","Golden",..: NA 3 2 5 4 NA NA 2 5 NA ...
##  $ ColorName3    : Factor w/ 6 levels "NA","Golden",..: NA NA NA 6 6 NA NA NA NA NA ...
##  $ size          : Factor w/ 5 levels "giant","large",..: NA NA NA NA 4 4 NA 1 1 NA ...
##  $ hypoallergenic: Factor w/ 2 levels "No","Yes": NA NA NA NA 2 2 1 1 1 1 ...

Converting the breedname and AllMeds variables to factor.

dog_data_merge1$breedname <- as.factor(dog_data_merge1$breedname)

dog_data_merge1$AllMeds <- as.factor(dog_data_merge1$AllMeds)

Here we are subsetting our numeric predictor variables to check their distributions.

#subset our 3 vars
distributions <- dog_data_merge1 %>% 
  keep(is.numeric)

Next we check if the variables contain any NA values.

colSums(is.na(distributions))

## Quantity      Fee VideoAmt PhotoAmt AgeYears 
##        0        0        0        0        0

#plot
distributions %>% 
  gather(variable, value, 1:5) %>%
  ggplot(aes(value)) +
    facet_wrap(~variable, scales = "free") +
    geom_density(fill = "steelblue", alpha=0.9, color="steelblue") +
    geom_histogram(aes(y=..density..), alpha=0.2, fill = "lightblue", color="lightblue", position="identity") +
    theme_minimal()

#box plots
distributions %>% 
  gather() %>%
  ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free") +
  geom_boxplot(fill = "steelblue", color="black", outlier.colour="red", outlier.shape=16,
             outlier.size=2, notch=FALSE) +
  theme_minimal()

head(dog_data_merge1)

##   Breed1 Gender Color1 Color2 Color3 MaturitySize FurLength Vaccinated Dewormed
## 1      0      2      2      0      0            2         1          1        1
## 2      0      2      2      3      0            2         1          1        1
## 3      0      1      1      2      0            2         2          2        2
## 4      0      2      2      5      7            1         1          2        2
## 5      1      2      1      4      7            2         1          2        2
## 6      1      3      4      0      0            2         1          1        1
##   Sterilized Health Quantity Fee VideoAmt PhotoAmt AdoptionSpeed     breedname
## 1          2      1        1   0        0        3             3          <NA>
## 2          2      1        1 100        0        5             4          <NA>
## 3          2      1        1   0        0        5             4          <NA>
## 4          2      1        1   1        0        0             3          <NA>
## 5          2      1        2   0        0        2             4 Affenpinscher
## 6          1      1        2   0        0        5             4 Affenpinscher
##   AgeYears NumColors AllMeds ColorName1 ColorName2 ColorName3  size
## 1        0         1       4      Brown       <NA>       <NA>  <NA>
## 2        0         2       4      Brown     Golden       <NA>  <NA>
## 3        6         2       6      Black      Brown       <NA>  <NA>
## 4        0         3       6      Brown      Cream      White  <NA>
## 5        0         3       6      Black     Yellow      White small
## 6        2         1       3     Yellow       <NA>       <NA> small
##   hypoallergenic
## 1           <NA>
## 2           <NA>
## 3           <NA>
## 4           <NA>
## 5            Yes
## 6            Yes

Transformations of numeric variables Next we will take transformations of the variables reviewed above to see if the transformed variables are worth looking into and using for our model.

Log Transformations

Hide

log_distributions <- log(distributions + 1)

# Histograms of log transformed numeric variables
log_distributions %>%
  gather(variable, value, 1:5) %>%
  ggplot(., aes(value)) + 
  geom_density(fill = "steelblue", color="steelblue") + 
  facet_wrap(~variable, scales ="free", ncol = 3) +
  labs(x = element_blank(), y = element_blank()) +
  theme_minimal()

Square Root transformations

sqrt_distributions <- sqrt(distributions)

sqrt_distributions %>% 
  gather(variable, value, 1:5) %>%
  ggplot(., aes(value)) + 
  geom_density(fill = "steelblue", color="steelblue") + 
  facet_wrap(~variable, scales ="free", ncol = 3) +
  labs(x = element_blank(), y = element_blank()) +
  theme_minimal()

Cube root transformations

cbrt_distributions <- (distributions)^(1/3)

cbrt_distributions %>% 
  gather(variable, value, 1:5) %>%
  ggplot(., aes(value)) + 
  geom_density(fill = "steelblue", color="steelblue") + 
  facet_wrap(~variable, scales ="free", ncol = 3) +
  labs(x = element_blank(), y = element_blank()) +
  theme_minimal()

Based on the above, all variables are still heavily right skewed with the exception of PhotoAmt which made some improvement, but still not normally distributed.

Exploratory Data Analysis Hide

#copy of data
train_data <- dog_data_merge1

train_data <- train_data %>% 
  select(-c(Breed1, Color1, Color2, Color3))

str(train_data)

## 'data.frame':    8132 obs. of  21 variables:
##  $ Gender        : Factor w/ 3 levels "1","2","3": 2 2 1 2 2 3 1 2 2 1 ...
##  $ MaturitySize  : Factor w/ 4 levels "1","2","3","4": 2 2 2 1 2 2 2 1 2 2 ...
##  $ FurLength     : Factor w/ 3 levels "1","2","3": 1 1 2 1 1 1 2 1 2 1 ...
##  $ Vaccinated    : Factor w/ 3 levels "1","2","3": 1 1 2 2 2 1 2 2 1 1 ...
##  $ Dewormed      : Factor w/ 3 levels "1","2","3": 1 1 2 2 2 1 1 2 1 1 ...
##  $ Sterilized    : Factor w/ 3 levels "1","2","3": 2 2 2 2 2 1 2 2 1 3 ...
##  $ Health        : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 2 1 1 ...
##  $ Quantity      : int  1 1 1 1 2 2 4 1 1 1 ...
##  $ Fee           : int  0 100 0 1 0 0 0 0 0 500 ...
##  $ VideoAmt      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PhotoAmt      : int  3 5 5 0 2 5 3 2 2 2 ...
##  $ AdoptionSpeed : Factor w/ 5 levels "0","1","2","3",..: 4 5 5 4 5 5 5 2 5 5 ...
##  $ breedname     : Factor w/ 112 levels "Affenpinscher",..: NA NA NA NA 1 1 2 3 3 4 ...
##  $ AgeYears      : num  0 0 6 0 0 2 0 0 2 1 ...
##  $ NumColors     : Factor w/ 3 levels "1","2","3": 1 2 2 3 3 1 1 2 2 1 ...
##  $ AllMeds       : Factor w/ 7 levels "3","4","5","6",..: 2 2 4 4 4 1 3 4 1 3 ...
##  $ ColorName1    : Factor w/ 7 levels "Black","Brown",..: 2 2 1 2 1 4 2 1 3 2 ...
##  $ ColorName2    : Factor w/ 7 levels "NA","Brown","Golden",..: NA 3 2 5 4 NA NA 2 5 NA ...
##  $ ColorName3    : Factor w/ 6 levels "NA","Golden",..: NA NA NA 6 6 NA NA NA NA NA ...
##  $ size          : Factor w/ 5 levels "giant","large",..: NA NA NA NA 4 4 NA 1 1 NA ...
##  $ hypoallergenic: Factor w/ 2 levels "No","Yes": NA NA NA NA 2 2 1 1 1 1 ...

Below we use wordcloud to visualize the dog breeds in our data. Note the Mixed Breed has been removed as it was skewing our data with the highest number of observations with 5923 total.

set.seed(123)
wordtable <- table(train_data$breedname) %>% 
  as.data.frame() %>% 
  arrange(desc(Freq)) %>% 
  slice(c(-1))

wordcloud(words = wordtable$Var1, freq = wordtable$Freq,
          max.words=250,random.order=FALSE, scale=c(2,.8),
          rot.per=0.1, colors=brewer.pal(8, "Dark2"))

As part of exploring the data, below we look into our response variable and see which level of AdoptionSpeed has the most occurrences. As a reminder from above, below is the legend for our AdoptionSpeed factor levels.

Predictor (Adoption Speed) Description: Predict how quickly, if at all, a pet is adopted.

The values are determined in the following way:

0: Pet was adopted on the same day as it was listed. 1: Pet was adopted between 1 and 7 days (1st week) after being listed. 2: Pet was adopted between 8 and 30 days (1st month) after being listed. 3: Pet was adopted between 31 and 90 days (2nd & 3rd month) after being listed. 4: No adoption after 100 days of being listed. Below we see that most dogs don’t get adopted after 100 days of being listed. The next highest level would be adoption within the first month of listing.

train_data %>% 
  group_by(AdoptionSpeed) %>%
  summarise(Count = length(AdoptionSpeed)) %>% 
  ggplot(aes(x=fct_reorder(factor(AdoptionSpeed),Count), 
             y= Count))+
  geom_col(position="dodge", fill= "steelblue")+
  geom_text(aes(label = Count, hjust = 1))+
  coord_flip()+
  labs(x= "Adoption Speed", y = "Count")+
  theme_minimal()

train_data %>% 
  group_by(PhotoAmt) %>%
  summarise(Count = length(PhotoAmt)) %>% 
  arrange(desc(Count)) %>% 
  top_n(10) %>% 
  ggplot(aes(x=fct_reorder(factor(PhotoAmt),Count), 
             y= Count))+
  geom_col(position="dodge", fill= "steelblue")+
  geom_text(aes(label = Count, hjust = 1))+
  coord_flip()+
  labs(x= "Photos per Listing", y = "Listing Count")+
  theme_minimal()

Below is a visual showing the amount of dogs per observation. The majority of the observation are single dog listings, but there are a number of listings with multiple dogs which in some cases may be a litter or two of dogs under one listing.

train_data %>% 
  group_by(Quantity) %>%
  summarise(Count = length(Quantity)) %>% 
  arrange(desc(Count)) %>% 
  top_n(5)%>% 
  ggplot(aes(x=fct_reorder(factor(Quantity),Count), 
             y= Count))+
  geom_col(position="dodge", fill= "steelblue")+
  geom_text(aes(label = Count, hjust = 1))+
  coord_flip()+
  labs(x= "Pets per Listing", y = "Listing Count")+
  theme_minimal()

numeric_values <- train_data 
numeric_values$AdoptionSpeed <- as.numeric(numeric_values$AdoptionSpeed)

numeric_values<- numeric_values %>% 
  select_if(is.numeric)
 
  


train_cor <- cor(numeric_values)
corrplot.mixed(train_cor, tl.col = 'black', tl.pos = 'lt', upper = "number", lower="circle")

library(Amelia)

map(train_data, ~sum(is.na(.))) %>% t()

##      Gender MaturitySize FurLength Vaccinated Dewormed Sterilized Health
## [1,] 0      0            0         0          0        0          0     
##      Quantity Fee VideoAmt PhotoAmt AdoptionSpeed breedname AgeYears NumColors
## [1,] 0        0   0        0        0             4         0        0        
##      AllMeds ColorName1 ColorName2 ColorName3 size hypoallergenic
## [1,] 0       0          2986       6422       6624 5927

missmap(train_data, main = "Missing vs Observed Values")

Based on the above, all variables are still heavily right skewed with the exception of PhotoAmt which made some improvement, but still not normally distributed.

Exploratory Data Analysis

Creating binary variable for th adoption speed, 1 for adopted and 0 for not adopted

# change all of the values in the data set to 1 or 0
train_data$AdoptionSpeed[train_data$AdoptionSpeed !=0 ] <- 1

As a binary variable now I will conduct a binary regression analysis on it.

Binomial NULL Model

# Build the binomial null regression
model1 <- glm(AdoptionSpeed ~ 1, data = train_data, family = binomial(link ="logit"))
summary(model1)

## 
## Call:
## glm(formula = AdoptionSpeed ~ 1, family = binomial(link = "logit"), 
##     data = train_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7813   0.2056   0.2056   0.2056   0.2056  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.84664    0.07751   49.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1651.5  on 8131  degrees of freedom
## Residual deviance: 1651.5  on 8131  degrees of freedom
## AIC: 1653.5
## 
## Number of Fisher Scoring iterations: 6

From the null regression we can see the AIC is 1653.5

Binomial Model

Binary Logistic Regression Full model utilizing all the variables and data, This model will be considered to be valid..

#  Binomial  Model
model2 <- glm(AdoptionSpeed ~  Gender+MaturitySize +FurLength+ Vaccinated+ Dewormed+ Sterilized + Health + Quantity + Fee + VideoAmt + PhotoAmt + AgeYears + NumColors, data = train_data, family = binomial(link ="logit"))

summary(model2)

## 
## Call:
## glm(formula = AdoptionSpeed ~ Gender + MaturitySize + FurLength + 
##     Vaccinated + Dewormed + Sterilized + Health + Quantity + 
##     Fee + VideoAmt + PhotoAmt + AgeYears + NumColors, family = binomial(link = "logit"), 
##     data = train_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2509   0.1501   0.1822   0.2245   0.5977  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    3.006e+00  3.249e-01   9.251  < 2e-16 ***
## Gender2        8.954e-02  1.646e-01   0.544  0.58639    
## Gender3        2.863e-01  4.292e-01   0.667  0.50468    
## MaturitySize2  8.638e-01  1.873e-01   4.611 4.01e-06 ***
## MaturitySize3  4.008e-01  2.676e-01   1.498  0.13418    
## MaturitySize4  1.349e+01  5.019e+02   0.027  0.97856    
## FurLength2    -3.285e-01  1.667e-01  -1.971  0.04872 *  
## FurLength3    -6.593e-01  2.891e-01  -2.281  0.02258 *  
## Vaccinated2    5.189e-01  2.792e-01   1.859  0.06306 .  
## Vaccinated3    7.306e-01  4.165e-01   1.754  0.07938 .  
## Dewormed2     -4.421e-01  2.851e-01  -1.550  0.12103    
## Dewormed3     -4.627e-01  4.056e-01  -1.141  0.25399    
## Sterilized2   -2.275e-01  2.195e-01  -1.036  0.30015    
## Sterilized3   -4.677e-01  2.874e-01  -1.627  0.10371    
## Health2       -1.266e-01  3.801e-01  -0.333  0.73899    
## Health3        1.283e+01  5.094e+02   0.025  0.97990    
## Quantity       9.982e-02  1.008e-01   0.990  0.32198    
## Fee           -2.088e-04  6.967e-04  -0.300  0.76442    
## VideoAmt      -3.879e-01  1.744e-01  -2.225  0.02610 *  
## PhotoAmt       8.942e-02  3.378e-02   2.647  0.00811 ** 
## AgeYears       2.580e-02  4.579e-02   0.563  0.57315    
## NumColors2     2.746e-01  1.849e-01   1.485  0.13754    
## NumColors3    -4.105e-01  2.022e-01  -2.030  0.04237 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1651.5  on 8131  degrees of freedom
## Residual deviance: 1579.2  on 8109  degrees of freedom
## AIC: 1625.2
## 
## Number of Fisher Scoring iterations: 15

plot(model2)

Checking the resul of the model, we can see the difference between teh different variables in the data set, as the AIC 1625.2

I will select the variables using the step method

The ‘Step’ function in R performs a Step model selection with an objective to minimize the AIC value.

Using Step in both direction

Create multiple models using the STEP function from R, here we check model1 and model2 and run a step in both direcction on these models.

Let’s check an ANOVA table based on the above testing results.

step_b$anova

##             Step Df  Deviance Resid. Df Resid. Dev      AIC
## 1                NA        NA      8131   1651.461 1653.461
## 2 + MaturitySize -3 28.476583      8128   1622.984 1630.984
## 3     + PhotoAmt -1  8.622869      8127   1614.361 1624.361
## 4    + NumColors -2  9.334944      8125   1605.026 1619.026
## 5    + FurLength -2  8.468409      8123   1596.558 1614.558
## 6     + VideoAmt -1  3.823620      8122   1592.734 1612.734
## 7     + Quantity -1  3.391945      8121   1589.342 1611.342

From the above results, it shows that the best model is as follows:

summary(step_b)

## 
## Call:
## glm(formula = AdoptionSpeed ~ MaturitySize + PhotoAmt + NumColors + 
##     FurLength + VideoAmt + Quantity, family = binomial(link = "logit"), 
##     data = train_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2917   0.1559   0.1864   0.2241   0.5909  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     2.92281    0.22695  12.878  < 2e-16 ***
## MaturitySize2   0.87295    0.18160   4.807 1.53e-06 ***
## MaturitySize3   0.40206    0.26610   1.511  0.13081    
## MaturitySize4  12.41566  306.10603   0.041  0.96765    
## PhotoAmt        0.09674    0.03354   2.884  0.00392 ** 
## NumColors2      0.27588    0.18456   1.495  0.13496    
## NumColors3     -0.41276    0.20146  -2.049  0.04048 *  
## FurLength2     -0.34512    0.16512  -2.090  0.03660 *  
## FurLength3     -0.65913    0.27623  -2.386  0.01703 *  
## VideoAmt       -0.39208    0.17259  -2.272  0.02311 *  
## Quantity        0.12471    0.07584   1.644  0.10008    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1651.5  on 8131  degrees of freedom
## Residual deviance: 1589.3  on 8121  degrees of freedom
## AIC: 1611.3
## 
## Number of Fisher Scoring iterations: 14

We see how all the predictors are statistical significant, also, noticing how the variable are as statistical significant and the compared values.

The below plot shows our fitted models vs Density.

hist(step_b$fitted.values, main = " Histogram ",xlab = "Fitted models", col = 'skyblue3')

Show the predicted values

train_data$Predict <- ifelse(step_b$fitted.values >0.5,"pos","neg")
head(train_data$Predict)

## [1] "pos" "pos" "pos" "pos" "pos" "pos"

Here we can see the model works really well on the train data, we need to use it on the evaluation data and check the accuracy of the model on it.

Using the model to predict the evaluation data

Pulling the evaluation data from Github

evaluation <- read.csv("https://raw.githubusercontent.com/akarimhammoud/Data_621/main/Final%20Project/data/TestData/test_cleaned.csv")

Transform the variables in the evaluation data.

evaluation <- transform(
  evaluation,
  Type=as.factor(Type),
  Breed1=as.factor(Breed1),
  Gender=as.factor(Gender),
  Color1=as.factor(Color1),
  Color2=as.factor(Color2),
  Color3=as.factor(Color3),
  MaturitySize=as.factor(MaturitySize),
  FurLength=as.factor(FurLength),
  Vaccinated = as.factor(Vaccinated),
  Dewormed = as.factor(Dewormed),
  Sterilized = as.factor(Sterilized),
  Health = as.factor(Health),
  NumColors = as.factor(NumColors)
)

Here lets transform the evaluation data to fit the model where we used the train data.

#transformed variables
evaluation <- evaluation %>% 
  dplyr::select(Gender,MaturitySize ,FurLength, Vaccinated, Dewormed, Sterilized , Health , Quantity , Fee , VideoAmt , PhotoAmt , AgeYears , NumColors)

#predictions of the model witht the cleaned evaluation data

#predictions
pred <- predict(model2, evaluation, type = "response")
summary(pred)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.6220  0.9710  0.9812  0.9770  0.9876  1.0000       1

here we can see the median 0.9812 and the mean 0.9770 based on the evaluation data, this is a really good prediction close the 100, and with running the test we can see the real value of this model.

# The predictions variable
pred_df <- as.data.frame(pred)

#Round the values to 4
pred_df$values <- round((exp(pred_df$pred)),4)

head(pred_df$values)

## [1] 2.6551 2.6850 2.6772 2.6545 2.6539 2.6519

The values here shows the density of the adoption rate is more around 2.65

Checking the predictive values for the evaluation data.

# Creating Bar chart for the predictive values.
ggplot(pred_df, aes(x=values)) + 
 geom_histogram(aes(y=..density..), colour="black", fill="white")+
 geom_density(alpha=.2, fill="steelblue")+
labs(title="Predicted Value",x="Adoption", y = "Density")

We can see that the adoption values is more skewed to the right, the density of the adoption rate is more around 2.5 to 2.75

DATA621: Final Project

Group 2: George Cruz Deschamps, Karim Hammoud, Maliat Islam, Gabriella Maritnez, Ken Popkin

Date: 2021-12-04 Due:2021-12-12

Overview

Learn more about the data

What to Predict

This Notebook…

Load Data

Missing Values Count

Show the type of each data

New features

Remove features

Show the updated data type

Add Labels for Colors

Dog Breed Data woof

Dog (data) Wrangling

Size Label by Breed

Add feature

Hypoallergenic

Add feature

Distributions of Numeric Variables

Creating binary variable for th adoption speed, 1 for adopted and 0 for not adopted

Binomial NULL Model

Binomial Model

Using the model to predict the evaluation data