There are millions of stray pets around the world, some of which are fortunate enough to be adopted while many others are not. While adoption of a pet is often the definition of success, the rate at which a pet is adopted is also a key success factor - pets that take a long time to adopt contribute to over-crowded animal shelters and can prevent taking on new strays. Sadly, pets that are not adopted eventually need to be euthanized.
Predictor (Adoption Speed) Description: Predict how quickly, if at all, a pet is adopted.
The values are determined in the following way: 0 - Pet was adopted on the same day as it was listed. 1 - Pet was adopted between 1 and 7 days (1st week) after being listed. 2 - Pet was adopted between 8 and 30 days (1st month) after being listed. 3 - Pet was adopted between 31 and 90 days (2nd & 3rd month) after being listed. 4 - No adoption after 100 days of being listed.
The data has no missing values, but there are a number of features in text that need to be converted to some numeric value. This notebook performs those changes.
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(message = FALSE)
knitr::opts_chunk$set(warning = FALSE)library(dplyr)
library(reshape)
library(ggplot2)
library(purrr)
library(psych)
library(tidyr)
library(corrplot)
library(forcats)
library(kableExtra)
library(stats)
library(GGally)
library(wordcloud)
library(caret )Load the data
## Type Name Age Breed1 Breed2 Gender Color1 Color2 Color3
## 1 1 Lil Milo 2 0 26 2 2 0 0
## 2 1 Bella 4 Months Puppy! 4 0 307 2 2 3 0
## MaturitySize FurLength Vaccinated Dewormed Sterilized Health Quantity Fee
## 1 2 1 1 1 2 1 1 0
## 2 2 1 1 1 2 1 1 100
## State RescuerID VideoAmt
## 1 41326 1a2113010d6048d5410b265347b35c91 0
## 2 41326 3673e167fc9932b13149bed1f2a0180a 0
## Description
## 1 Milo went missing after a week with her new adoptive family. Only 3 months old, light brown coat. Missing from Jalan Kiara, Bandar Botanic, Klang. Please call Su at if you've seen her.
## 2 She's only 4 months old, very friendly and loving. Loves attention. A little naughty sometimes. But she's adorable. I adopted her from MDDB, but recently I have just moved to a condo. Im finding a perfect and loving home for her.
## PetID PhotoAmt AdoptionSpeed breedname
## 1 375905770 3 3 <NA>
## 2 da8d4a273 5 4 <NA>
There are no missing values
map(data, ~sum(is.na(.))) %>% t()## Type Name Age Breed1 Breed2 Gender Color1 Color2 Color3 MaturitySize
## [1,] 0 0 0 0 0 0 0 0 0 0
## FurLength Vaccinated Dewormed Sterilized Health Quantity Fee State
## [1,] 0 0 0 0 0 0 0 0
## RescuerID VideoAmt Description PetID PhotoAmt AdoptionSpeed breedname
## [1,] 0 0 0 0 0 0 5
str(data)## 'data.frame': 14993 obs. of 25 variables:
## $ Type : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Name : chr "Lil Milo" "Bella 4 Months Puppy!" "" "\"Boy Boy\"" ...
## $ Age : int 2 4 3 72 2 5 24 3 0 24 ...
## $ Breed1 : int 0 0 0 0 0 1 1 3 5 5 ...
## $ Breed2 : int 26 307 266 307 205 0 0 0 0 307 ...
## $ Gender : int 2 2 3 1 2 2 3 1 2 2 ...
## $ Color1 : int 2 2 1 1 2 1 4 2 1 3 ...
## $ Color2 : int 0 3 4 2 5 4 0 0 2 5 ...
## $ Color3 : int 0 0 7 0 7 7 0 0 0 0 ...
## $ MaturitySize : int 2 2 1 2 1 2 2 2 1 2 ...
## $ FurLength : int 1 1 1 2 1 1 1 2 1 2 ...
## $ Vaccinated : int 1 1 2 2 2 2 1 2 2 1 ...
## $ Dewormed : int 1 1 1 2 2 2 1 1 2 1 ...
## $ Sterilized : int 2 2 2 2 2 2 1 2 2 1 ...
## $ Health : int 1 1 1 1 1 1 1 1 2 1 ...
## $ Quantity : int 1 1 3 1 1 2 2 4 1 1 ...
## $ Fee : int 0 100 0 0 1 0 0 0 0 0 ...
## $ State : int 41326 41326 41401 41326 41336 41326 41330 41326 41401 41326 ...
## $ RescuerID : chr "1a2113010d6048d5410b265347b35c91" "3673e167fc9932b13149bed1f2a0180a" "f7cff59d10c867bdee12c3f35f34d086" "94b991f8dc1e0bb903ca8d4d492c8d43" ...
## $ VideoAmt : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Description : chr "Milo went missing after a week with her new adoptive family. Only 3 months old, light brown coat. Missing from "| __truncated__ "She's only 4 months old, very friendly and loving. Loves attention. A little naughty sometimes. But she's adora"| __truncated__ "Mama cat came to house and gave birth to these 03 lovely kittens, please adopt them and give them a home sweet home." "He is a stray dog found wandering around University Putra Malaysia (UPM), Serdang main campus. I have been told"| __truncated__ ...
## $ PetID : chr "375905770" "da8d4a273" "27e74e45c" "7b5bee232" ...
## $ PhotoAmt : int 3 5 11 5 0 2 5 3 2 2 ...
## $ AdoptionSpeed: int 3 4 2 4 3 4 4 4 1 4 ...
## $ breedname : chr NA NA NA NA ...
AgeYears - provided in months this is too many variations. Decided to convert to new feature of AgeYears and drop Age
data$AgeYears = round(data$Age/12, 0)NumColors - derived by counting the number of colors a pet has using the Color1,2, and 3 features - keep these features, but given there are 7 color categories the value of this feature is questionable in doubt - create a NumColors feature with a 1,2, or 3 respectively
num_colors_vec = c()
for (i in 1:nrow(data)){
ncolors = 1
color2 <- data[i,8]
color3 <- data[i,9]
if (color2 > 0){ncolors = ncolors + 1}
if (color3 > 0){ncolors = ncolors + 1}
num_colors_vec <- c(num_colors_vec, ncolors)
}
data$NumColors = num_colors_vecAllMeds = the sum of Vaccinated, Dewormed, and Sterilized. Combining these into one feature could potentially reduce multi-collinearity
data$AllMeds = data$Vaccinated + data$Dewormed + data$SterilizedRemove the following features… Name - this is a text field and new owners can (and usually do) rename their pets, so removing this feature Age - replaced with AgeYears (see above) Breed2 - 10,700 of almost 15,000 rows are populated with 0 (unknown), so this doesn’t seem like a good feature to keep State - initially kept this, but the correlation to AdoptionSpeed is only about 2%. Decided to remove it RescuerID - common sense is that this field will not have any predictive value for adoption rate Description - this will have value in future analysis for NLP, but this will be evaluated differently in another notebook PetID - same reason as RescuerID
data = subset(data, select = -c(Name, Age, Breed2, State, RescuerID, Description, PetID))sapply(data,class)## Type Breed1 Gender Color1 Color2
## "integer" "integer" "integer" "integer" "integer"
## Color3 MaturitySize FurLength Vaccinated Dewormed
## "integer" "integer" "integer" "integer" "integer"
## Sterilized Health Quantity Fee VideoAmt
## "integer" "integer" "integer" "integer" "integer"
## PhotoAmt AdoptionSpeed breedname AgeYears NumColors
## "integer" "integer" "character" "numeric" "numeric"
## AllMeds
## "integer"
#Change categorical variables from integer to factors
data <- transform(
data,
Type=as.factor(Type),
Breed1=as.factor(Breed1),
Gender=as.factor(Gender),
Color1=as.factor(Color1),
Color2=as.factor(Color2),
Color3=as.factor(Color3),
MaturitySize=as.factor(MaturitySize),
FurLength=as.factor(FurLength),
Vaccinated = as.factor(Vaccinated),
Dewormed = as.factor(Dewormed),
Sterilized = as.factor(Sterilized),
Health = as.factor(Health),
AdoptionSpeed = as.factor(AdoptionSpeed),
NumColors = as.factor(NumColors)
)str(data)## 'data.frame': 14993 obs. of 21 variables:
## $ Type : Factor w/ 2 levels "1","2": 1 1 2 1 1 1 1 1 1 1 ...
## $ Breed1 : Factor w/ 176 levels "0","1","3","5",..: 1 1 1 1 1 2 2 3 4 4 ...
## $ Gender : Factor w/ 3 levels "1","2","3": 2 2 3 1 2 2 3 1 2 2 ...
## $ Color1 : Factor w/ 7 levels "1","2","3","4",..: 2 2 1 1 2 1 4 2 1 3 ...
## $ Color2 : Factor w/ 7 levels "0","2","3","4",..: 1 3 4 2 5 4 1 1 2 5 ...
## $ Color3 : Factor w/ 6 levels "0","3","4","5",..: 1 1 6 1 6 6 1 1 1 1 ...
## $ MaturitySize : Factor w/ 4 levels "1","2","3","4": 2 2 1 2 1 2 2 2 1 2 ...
## $ FurLength : Factor w/ 3 levels "1","2","3": 1 1 1 2 1 1 1 2 1 2 ...
## $ Vaccinated : Factor w/ 3 levels "1","2","3": 1 1 2 2 2 2 1 2 2 1 ...
## $ Dewormed : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 1 1 2 1 ...
## $ Sterilized : Factor w/ 3 levels "1","2","3": 2 2 2 2 2 2 1 2 2 1 ...
## $ Health : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 1 ...
## $ Quantity : int 1 1 3 1 1 2 2 4 1 1 ...
## $ Fee : int 0 100 0 0 1 0 0 0 0 0 ...
## $ VideoAmt : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PhotoAmt : int 3 5 11 5 0 2 5 3 2 2 ...
## $ AdoptionSpeed: Factor w/ 5 levels "0","1","2","3",..: 4 5 3 5 4 5 5 5 2 5 ...
## $ breedname : chr NA NA NA NA ...
## $ AgeYears : num 0 0 0 6 0 0 2 0 0 2 ...
## $ NumColors : Factor w/ 3 levels "1","2","3": 1 2 3 2 3 3 1 1 2 2 ...
## $ AllMeds : int 4 4 5 6 6 6 3 5 6 3 ...
Import the color labels to see the names of the colors.
colors <- read.csv('./data/color_labels.csv')
glimpse(colors)## Rows: 7
## Columns: 2
## $ ColorID <int> 1, 2, 3, 4, 5, 6, 7
## $ ColorName <chr> "Black", "Brown", "Golden", "Yellow", "Cream", "Gray", "Whit…
Checking the levels of Color1, Color2, and Color3
levels(data$Color1)## [1] "1" "2" "3" "4" "5" "6" "7"
levels(data$Color2)## [1] "0" "2" "3" "4" "5" "6" "7"
levels(data$Color3)## [1] "0" "3" "4" "5" "6" "7"
Creating new variables to name the colors.
data$ColorName1 <- recode_factor(data$Color1,"1"= "Black",
"2"= "Brown",
"3"= "Golden",
"4"= "Yellow",
"5"= "Cream",
"6"= "Gray",
"7"= "White")data$ColorName2 <- recode_factor(data$Color2, "0" = "NA",
"2"= "Brown",
"3"= "Golden",
"4"= "Yellow",
"5"= "Cream",
"6"= "Gray",
"7"= "White")data$ColorName3 <- recode_factor(data$Color3, "0" = "NA",
"3"= "Golden",
"4"= "Yellow",
"5"= "Cream",
"6"= "Gray",
"7"= "White")Converting the char version of “NA” introduced in ColorNames 1-3 to type NA.
data[data=="NA"]<-NALets check how many dogs and cats we have:
table(data$Type)##
## 1 2
## 8132 6861
It looks like we have 1262 more dogs than cats, with 8132 dogs and 6861 cats.
Subset data by Type to focus on dogs.
dog_data = filter(data, Type == 1)
dog_data = subset(dog_data, select = -c(Type))dog_data$breedname <- gsub("Amer ", "American ", dog_data$breedname) Next we will collapse duplicate Labrador Retriever factors (Black, Chocolate, and Yellow) into one as they are not considered to be separate Retriever breeds.
dog_data$breedname <- fct_collapse(dog_data$breedname,
'Labrador Retriever' =
c("Black Labrador Retriever",
"Chocolate Labrador Retriever",
"Yellow Labrador Retriever"))Below are the top 10 breeds where Mixed Breed has a heavy majority with more than half of the dogs being categorized as a mix.
table(dog_data$breedname) %>%
as.data.frame() %>%
arrange(desc(Freq)) %>%
top_n(10)## Var1 Freq
## 1 Mixed Breed 5923
## 2 Labrador Retriever 226
## 3 Shih Tzu 189
## 4 Poodle 167
## 5 Terrier 161
## 6 Golden Retriever 151
## 7 German Shepherd Dog 98
## 8 Beagle 90
## 9 Spitz 89
## 10 Rottweiler 88
#subset(Freq!=0)When deciding to adopt a new pet a consideration many people have is the size of the dog for a number of reasons. Size plays an important role in a dog’s strength and activity level which is a concern for a potential new owner, in addition to space in ones house or if they have a backyard or not.
Next we bring in the AKC Breed Info table to load details on dog sizes by breed.
breed_size <- read.csv('./data/TrainingData/AKC Breed Info.csv')
str(breed_size)## 'data.frame': 150 obs. of 5 variables:
## $ BreedName : chr "Akita" "Anatolian Sheepdog" "Bernese Mountain Dog" "Bloodhound" ...
## $ height_low_inches : chr "26" "27" "23" "24" ...
## $ height_high_inches: chr "28" "29" "27" "26" ...
## $ weight_low_lbs : int 80 100 85 80 70 100 120 95 130 90 ...
## $ weight_high_lbs : int 120 150 110 120 100 130 160 120 150 150 ...
Remove one Fox Terrier of the two entries. Both entries have the same size data, only difference is hair type (smooth vs. wire).
breed_size <- breed_size %>%
slice(-116)breed_size$BreedName <- recode(breed_size$BreedName,
"Collie (Rough) & (Smooth)" = "Collie",
"Fox Terrier ‰ÛÒ Wirehair" = "Fox Terrier")Using the breed_size table, a feature can be created using the American Kennel Club Breeds by Size breakdown.
Alternatively, here is another breed type breakdown by size.
breed_size$avg_weight <- rowMeans(breed_size[,c('weight_low_lbs',
'weight_high_lbs')], na.rm=TRUE)Since there are two types of Cocker Spaniels (American and English) in our breed_size table, and our dog_data has Cocker Spaniel, I took the average of both the American and English to give an idea of what a general Cocker Spaniel might weigh to then be able to classify its size.
#[row,column]
breed_size$height_low_inches <- as.integer(breed_size$height_low_inches)
breed_size$height_high_inches <- as.integer(breed_size$height_high_inches)
#cocker spaniel avg
cs <- sapply(breed_size[c(68,69),],mean)
breed_size<-rbind(breed_size, cs)
breed_size[150,1] = "Cocker Spaniel"
breed_size[68,1] ="American Cocker Spaniel"
breed_size[69,1] ="English Cocker Spaniel"
#poodle avg
poodle <- sapply(breed_size[c(122,48,147),],mean)
breed_size<-rbind(breed_size, poodle)
breed_size[151,1] = "Poodle"Create a size feature based on breed using case_when()1.
breed_size<- breed_size %>%
mutate(
size = case_when(
(avg_weight >= 3 & avg_weight < 9) ~ "toy",
(avg_weight >= 9 & avg_weight < 35) ~ "small",
(avg_weight >= 35 & avg_weight < 65) ~ "medium",
(avg_weight >= 65 & avg_weight < 85) ~ "large",
(avg_weight >= 85) ~ "giant"
)
)Now we will left_join() 2 our dog_data with our breed_size to bring in the size feature by breed.
dog_data_merge <- left_join(dog_data, select(breed_size, c(1,7)),
by=c("breedname" = "BreedName"))
dog_data_merge$size <- as.factor(dog_data_merge$size)
table(dog_data_merge$size) %>%
as.data.frame() %>%
arrange(desc(Freq))## Var1 Freq
## 1 small 608
## 2 large 557
## 3 medium 140
## 4 giant 119
## 5 toy 84
What is a hypoallergenic dog?
According to the American College of Allergy, Asthma, and Immunology, as much as 10% of the U.S. population is allergic to dogs. While there are no 100% hypoallergenic dogs, there are many breeds that do well with allergy sufferers. Dander, which is attached to pet hair, is what causes most pet allergies in humans and these dogs have a non-shedding coat that produces less dander.
So if you are either part of the 10% of the U.s. population that is allergic to dogs or just don’t like to have to vacuum dog hair twice a day then a hypoallergenic dog will be more appealing to you than a non-hypoallergenic dog.
Below we create a dataframe using the list of hypoallergenic dogs from the American Kennel Club. Once created, we left_join() it to our dog_data_merge dataframe to add the new feature for analysis.
breed <- c("Affenpinscher", "Afgan Hound", "American Hairless Terrier",
"Barbado da Terceira", "Bedlington Terrier", "Bichon Frise",
"Bolongese", "Chinese Crested", "Coton de Tulear",
"Giant Schnauzer", "Irish Water Spaniel",
"Kerry Blue Terrier","Lagotto Romangnolo", "Lowchen",
"Maltese","Miniature Shnauzer","Peruvian Inca Orchid",
"Poodle", "Standard Poodle", "Schnauzer",
"Portuguese Water Dog", "Russian Tsvetnaya Bolonka",
"Wheaten Terrier","Xoloitzcuintli", "Yorkshire Terrier"
)
hypoallergenic <- as.data.frame(breed)
hypoallergenic$hypoallergenic <- "Yes"dog_data_merge1 <- left_join(dog_data_merge, hypoallergenic,
by=c("breedname" = "breed"))
#when breedname = Mixed Breed & NA hypo = NA, all else should be "No"
dog_data_merge1$hypoallergenic[is.na(dog_data_merge1$hypoallergenic)] <- "No"
dog_data_merge1$hypoallergenic[which(dog_data_merge1$breedname == "Mixed Breed")] <- NA
dog_data_merge1$hypoallergenic[which(is.na(dog_data_merge1$breedname))] <- NA
dog_data_merge1$hypoallergenic <- as.factor(dog_data_merge1$hypoallergenic)How many hypoallergenic and non-hypoallergenic do we have?
table(dog_data_merge1$hypoallergenic)##
## No Yes
## 1942 263
sum(table(dog_data_merge1$hypoallergenic))## [1] 2205
Mixed Breeds and breeds that are NA in the breedname column will be NA for hypoallergenic as it logically can’t be confirmed if they are hypoallergenic or not. In general, there are Mixed Breeds that can be considered hypoallergenic. For example, a dog that is part Poodle (hypoallergenic) and part cavalier King Charles Spaniel (non-hypoallergenic) can be considered hypoallergenic, the resulting pup would is often called Cavapoo or Cavoodle.
How many NA breed do we have?
table(is.na(dog_data_merge1$breedname))##
## FALSE TRUE
## 8128 4
How many are Mixed Breed?
table(dog_data_merge1$breedname=="Mixed Breed")##
## FALSE TRUE
## 2205 5923
Here we check our data with the added features of size and hypoallergenic.
str(dog_data_merge1)## 'data.frame': 8132 obs. of 25 variables:
## $ Breed1 : Factor w/ 176 levels "0","1","3","5",..: 1 1 1 1 2 2 3 4 4 5 ...
## $ Gender : Factor w/ 3 levels "1","2","3": 2 2 1 2 2 3 1 2 2 1 ...
## $ Color1 : Factor w/ 7 levels "1","2","3","4",..: 2 2 1 2 1 4 2 1 3 2 ...
## $ Color2 : Factor w/ 7 levels "0","2","3","4",..: 1 3 2 5 4 1 1 2 5 1 ...
## $ Color3 : Factor w/ 6 levels "0","3","4","5",..: 1 1 1 6 6 1 1 1 1 1 ...
## $ MaturitySize : Factor w/ 4 levels "1","2","3","4": 2 2 2 1 2 2 2 1 2 2 ...
## $ FurLength : Factor w/ 3 levels "1","2","3": 1 1 2 1 1 1 2 1 2 1 ...
## $ Vaccinated : Factor w/ 3 levels "1","2","3": 1 1 2 2 2 1 2 2 1 1 ...
## $ Dewormed : Factor w/ 3 levels "1","2","3": 1 1 2 2 2 1 1 2 1 1 ...
## $ Sterilized : Factor w/ 3 levels "1","2","3": 2 2 2 2 2 1 2 2 1 3 ...
## $ Health : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 2 1 1 ...
## $ Quantity : int 1 1 1 1 2 2 4 1 1 1 ...
## $ Fee : int 0 100 0 1 0 0 0 0 0 500 ...
## $ VideoAmt : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PhotoAmt : int 3 5 5 0 2 5 3 2 2 2 ...
## $ AdoptionSpeed : Factor w/ 5 levels "0","1","2","3",..: 4 5 5 4 5 5 5 2 5 5 ...
## $ breedname : chr NA NA NA NA ...
## $ AgeYears : num 0 0 6 0 0 2 0 0 2 1 ...
## $ NumColors : Factor w/ 3 levels "1","2","3": 1 2 2 3 3 1 1 2 2 1 ...
## $ AllMeds : int 4 4 6 6 6 3 5 6 3 5 ...
## $ ColorName1 : Factor w/ 7 levels "Black","Brown",..: 2 2 1 2 1 4 2 1 3 2 ...
## $ ColorName2 : Factor w/ 7 levels "NA","Brown","Golden",..: NA 3 2 5 4 NA NA 2 5 NA ...
## $ ColorName3 : Factor w/ 6 levels "NA","Golden",..: NA NA NA 6 6 NA NA NA NA NA ...
## $ size : Factor w/ 5 levels "giant","large",..: NA NA NA NA 4 4 NA 1 1 NA ...
## $ hypoallergenic: Factor w/ 2 levels "No","Yes": NA NA NA NA 2 2 1 1 1 1 ...
Converting the breedname and AllMeds variables to factor.
dog_data_merge1$breedname <- as.factor(dog_data_merge1$breedname)
dog_data_merge1$AllMeds <- as.factor(dog_data_merge1$AllMeds)Here we are subsetting our numeric predictor variables to check their distributions.
#subset our 3 vars
distributions <- dog_data_merge1 %>%
keep(is.numeric)Next we check if the variables contain any NA values.
colSums(is.na(distributions))## Quantity Fee VideoAmt PhotoAmt AgeYears
## 0 0 0 0 0
#plot
distributions %>%
gather(variable, value, 1:5) %>%
ggplot(aes(value)) +
facet_wrap(~variable, scales = "free") +
geom_density(fill = "steelblue", alpha=0.9, color="steelblue") +
geom_histogram(aes(y=..density..), alpha=0.2, fill = "lightblue", color="lightblue", position="identity") +
theme_minimal()#box plots
distributions %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_boxplot(fill = "steelblue", color="black", outlier.colour="red", outlier.shape=16,
outlier.size=2, notch=FALSE) +
theme_minimal()head(dog_data_merge1)## Breed1 Gender Color1 Color2 Color3 MaturitySize FurLength Vaccinated Dewormed
## 1 0 2 2 0 0 2 1 1 1
## 2 0 2 2 3 0 2 1 1 1
## 3 0 1 1 2 0 2 2 2 2
## 4 0 2 2 5 7 1 1 2 2
## 5 1 2 1 4 7 2 1 2 2
## 6 1 3 4 0 0 2 1 1 1
## Sterilized Health Quantity Fee VideoAmt PhotoAmt AdoptionSpeed breedname
## 1 2 1 1 0 0 3 3 <NA>
## 2 2 1 1 100 0 5 4 <NA>
## 3 2 1 1 0 0 5 4 <NA>
## 4 2 1 1 1 0 0 3 <NA>
## 5 2 1 2 0 0 2 4 Affenpinscher
## 6 1 1 2 0 0 5 4 Affenpinscher
## AgeYears NumColors AllMeds ColorName1 ColorName2 ColorName3 size
## 1 0 1 4 Brown <NA> <NA> <NA>
## 2 0 2 4 Brown Golden <NA> <NA>
## 3 6 2 6 Black Brown <NA> <NA>
## 4 0 3 6 Brown Cream White <NA>
## 5 0 3 6 Black Yellow White small
## 6 2 1 3 Yellow <NA> <NA> small
## hypoallergenic
## 1 <NA>
## 2 <NA>
## 3 <NA>
## 4 <NA>
## 5 Yes
## 6 Yes
Transformations of numeric variables Next we will take transformations of the variables reviewed above to see if the transformed variables are worth looking into and using for our model.
Log Transformations
Hide
log_distributions <- log(distributions + 1)
# Histograms of log transformed numeric variables
log_distributions %>%
gather(variable, value, 1:5) %>%
ggplot(., aes(value)) +
geom_density(fill = "steelblue", color="steelblue") +
facet_wrap(~variable, scales ="free", ncol = 3) +
labs(x = element_blank(), y = element_blank()) +
theme_minimal()Square Root transformations
sqrt_distributions <- sqrt(distributions)
sqrt_distributions %>%
gather(variable, value, 1:5) %>%
ggplot(., aes(value)) +
geom_density(fill = "steelblue", color="steelblue") +
facet_wrap(~variable, scales ="free", ncol = 3) +
labs(x = element_blank(), y = element_blank()) +
theme_minimal()Cube root transformations
cbrt_distributions <- (distributions)^(1/3)
cbrt_distributions %>%
gather(variable, value, 1:5) %>%
ggplot(., aes(value)) +
geom_density(fill = "steelblue", color="steelblue") +
facet_wrap(~variable, scales ="free", ncol = 3) +
labs(x = element_blank(), y = element_blank()) +
theme_minimal()Based on the above, all variables are still heavily right skewed with the exception of PhotoAmt which made some improvement, but still not normally distributed.
Exploratory Data Analysis Hide
#copy of data
train_data <- dog_data_merge1train_data <- train_data %>%
select(-c(Breed1, Color1, Color2, Color3))
str(train_data)## 'data.frame': 8132 obs. of 21 variables:
## $ Gender : Factor w/ 3 levels "1","2","3": 2 2 1 2 2 3 1 2 2 1 ...
## $ MaturitySize : Factor w/ 4 levels "1","2","3","4": 2 2 2 1 2 2 2 1 2 2 ...
## $ FurLength : Factor w/ 3 levels "1","2","3": 1 1 2 1 1 1 2 1 2 1 ...
## $ Vaccinated : Factor w/ 3 levels "1","2","3": 1 1 2 2 2 1 2 2 1 1 ...
## $ Dewormed : Factor w/ 3 levels "1","2","3": 1 1 2 2 2 1 1 2 1 1 ...
## $ Sterilized : Factor w/ 3 levels "1","2","3": 2 2 2 2 2 1 2 2 1 3 ...
## $ Health : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 2 1 1 ...
## $ Quantity : int 1 1 1 1 2 2 4 1 1 1 ...
## $ Fee : int 0 100 0 1 0 0 0 0 0 500 ...
## $ VideoAmt : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PhotoAmt : int 3 5 5 0 2 5 3 2 2 2 ...
## $ AdoptionSpeed : Factor w/ 5 levels "0","1","2","3",..: 4 5 5 4 5 5 5 2 5 5 ...
## $ breedname : Factor w/ 112 levels "Affenpinscher",..: NA NA NA NA 1 1 2 3 3 4 ...
## $ AgeYears : num 0 0 6 0 0 2 0 0 2 1 ...
## $ NumColors : Factor w/ 3 levels "1","2","3": 1 2 2 3 3 1 1 2 2 1 ...
## $ AllMeds : Factor w/ 7 levels "3","4","5","6",..: 2 2 4 4 4 1 3 4 1 3 ...
## $ ColorName1 : Factor w/ 7 levels "Black","Brown",..: 2 2 1 2 1 4 2 1 3 2 ...
## $ ColorName2 : Factor w/ 7 levels "NA","Brown","Golden",..: NA 3 2 5 4 NA NA 2 5 NA ...
## $ ColorName3 : Factor w/ 6 levels "NA","Golden",..: NA NA NA 6 6 NA NA NA NA NA ...
## $ size : Factor w/ 5 levels "giant","large",..: NA NA NA NA 4 4 NA 1 1 NA ...
## $ hypoallergenic: Factor w/ 2 levels "No","Yes": NA NA NA NA 2 2 1 1 1 1 ...
Below we use wordcloud to visualize the dog breeds in our data. Note the Mixed Breed has been removed as it was skewing our data with the highest number of observations with 5923 total.
set.seed(123)
wordtable <- table(train_data$breedname) %>%
as.data.frame() %>%
arrange(desc(Freq)) %>%
slice(c(-1))
wordcloud(words = wordtable$Var1, freq = wordtable$Freq,
max.words=250,random.order=FALSE, scale=c(2,.8),
rot.per=0.1, colors=brewer.pal(8, "Dark2"))As part of exploring the data, below we look into our response variable and see which level of AdoptionSpeed has the most occurrences. As a reminder from above, below is the legend for our AdoptionSpeed factor levels.
Predictor (Adoption Speed) Description: Predict how quickly, if at all, a pet is adopted.
The values are determined in the following way:
0: Pet was adopted on the same day as it was listed. 1: Pet was adopted between 1 and 7 days (1st week) after being listed. 2: Pet was adopted between 8 and 30 days (1st month) after being listed. 3: Pet was adopted between 31 and 90 days (2nd & 3rd month) after being listed. 4: No adoption after 100 days of being listed. Below we see that most dogs don’t get adopted after 100 days of being listed. The next highest level would be adoption within the first month of listing.
train_data %>%
group_by(AdoptionSpeed) %>%
summarise(Count = length(AdoptionSpeed)) %>%
ggplot(aes(x=fct_reorder(factor(AdoptionSpeed),Count),
y= Count))+
geom_col(position="dodge", fill= "steelblue")+
geom_text(aes(label = Count, hjust = 1))+
coord_flip()+
labs(x= "Adoption Speed", y = "Count")+
theme_minimal()train_data %>%
group_by(PhotoAmt) %>%
summarise(Count = length(PhotoAmt)) %>%
arrange(desc(Count)) %>%
top_n(10) %>%
ggplot(aes(x=fct_reorder(factor(PhotoAmt),Count),
y= Count))+
geom_col(position="dodge", fill= "steelblue")+
geom_text(aes(label = Count, hjust = 1))+
coord_flip()+
labs(x= "Photos per Listing", y = "Listing Count")+
theme_minimal()Below is a visual showing the amount of dogs per observation. The majority of the observation are single dog listings, but there are a number of listings with multiple dogs which in some cases may be a litter or two of dogs under one listing.
train_data %>%
group_by(Quantity) %>%
summarise(Count = length(Quantity)) %>%
arrange(desc(Count)) %>%
top_n(5)%>%
ggplot(aes(x=fct_reorder(factor(Quantity),Count),
y= Count))+
geom_col(position="dodge", fill= "steelblue")+
geom_text(aes(label = Count, hjust = 1))+
coord_flip()+
labs(x= "Pets per Listing", y = "Listing Count")+
theme_minimal()numeric_values <- train_data
numeric_values$AdoptionSpeed <- as.numeric(numeric_values$AdoptionSpeed)
numeric_values<- numeric_values %>%
select_if(is.numeric)
train_cor <- cor(numeric_values)
corrplot.mixed(train_cor, tl.col = 'black', tl.pos = 'lt', upper = "number", lower="circle")library(Amelia)
map(train_data, ~sum(is.na(.))) %>% t()## Gender MaturitySize FurLength Vaccinated Dewormed Sterilized Health
## [1,] 0 0 0 0 0 0 0
## Quantity Fee VideoAmt PhotoAmt AdoptionSpeed breedname AgeYears NumColors
## [1,] 0 0 0 0 0 4 0 0
## AllMeds ColorName1 ColorName2 ColorName3 size hypoallergenic
## [1,] 0 0 2986 6422 6624 5927
missmap(train_data, main = "Missing vs Observed Values")Based on the above, all variables are still heavily right skewed with the exception of PhotoAmt which made some improvement, but still not normally distributed.
Exploratory Data Analysis
# change all of the values in the data set to 1 or 0
train_data$AdoptionSpeed[train_data$AdoptionSpeed !=0 ] <- 1As a binary variable now I will conduct a binary regression analysis on it.
# Build the binomial null regression
model1 <- glm(AdoptionSpeed ~ 1, data = train_data, family = binomial(link ="logit"))
summary(model1)##
## Call:
## glm(formula = AdoptionSpeed ~ 1, family = binomial(link = "logit"),
## data = train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7813 0.2056 0.2056 0.2056 0.2056
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.84664 0.07751 49.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1651.5 on 8131 degrees of freedom
## Residual deviance: 1651.5 on 8131 degrees of freedom
## AIC: 1653.5
##
## Number of Fisher Scoring iterations: 6
From the null regression we can see the AIC is 1653.5
Binary Logistic Regression Full model utilizing all the variables and data, This model will be considered to be valid..
# Binomial Model
model2 <- glm(AdoptionSpeed ~ Gender+MaturitySize +FurLength+ Vaccinated+ Dewormed+ Sterilized + Health + Quantity + Fee + VideoAmt + PhotoAmt + AgeYears + NumColors, data = train_data, family = binomial(link ="logit"))
summary(model2)##
## Call:
## glm(formula = AdoptionSpeed ~ Gender + MaturitySize + FurLength +
## Vaccinated + Dewormed + Sterilized + Health + Quantity +
## Fee + VideoAmt + PhotoAmt + AgeYears + NumColors, family = binomial(link = "logit"),
## data = train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2509 0.1501 0.1822 0.2245 0.5977
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.006e+00 3.249e-01 9.251 < 2e-16 ***
## Gender2 8.954e-02 1.646e-01 0.544 0.58639
## Gender3 2.863e-01 4.292e-01 0.667 0.50468
## MaturitySize2 8.638e-01 1.873e-01 4.611 4.01e-06 ***
## MaturitySize3 4.008e-01 2.676e-01 1.498 0.13418
## MaturitySize4 1.349e+01 5.019e+02 0.027 0.97856
## FurLength2 -3.285e-01 1.667e-01 -1.971 0.04872 *
## FurLength3 -6.593e-01 2.891e-01 -2.281 0.02258 *
## Vaccinated2 5.189e-01 2.792e-01 1.859 0.06306 .
## Vaccinated3 7.306e-01 4.165e-01 1.754 0.07938 .
## Dewormed2 -4.421e-01 2.851e-01 -1.550 0.12103
## Dewormed3 -4.627e-01 4.056e-01 -1.141 0.25399
## Sterilized2 -2.275e-01 2.195e-01 -1.036 0.30015
## Sterilized3 -4.677e-01 2.874e-01 -1.627 0.10371
## Health2 -1.266e-01 3.801e-01 -0.333 0.73899
## Health3 1.283e+01 5.094e+02 0.025 0.97990
## Quantity 9.982e-02 1.008e-01 0.990 0.32198
## Fee -2.088e-04 6.967e-04 -0.300 0.76442
## VideoAmt -3.879e-01 1.744e-01 -2.225 0.02610 *
## PhotoAmt 8.942e-02 3.378e-02 2.647 0.00811 **
## AgeYears 2.580e-02 4.579e-02 0.563 0.57315
## NumColors2 2.746e-01 1.849e-01 1.485 0.13754
## NumColors3 -4.105e-01 2.022e-01 -2.030 0.04237 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1651.5 on 8131 degrees of freedom
## Residual deviance: 1579.2 on 8109 degrees of freedom
## AIC: 1625.2
##
## Number of Fisher Scoring iterations: 15
plot(model2)Checking the resul of the model, we can see the difference between teh different variables in the data set, as the AIC 1625.2
I will select the variables using the step method
The ‘Step’ function in R performs a Step model selection with an objective to minimize the AIC value.
Using Step in both direction
Create multiple models using the STEP function from R, here we check model1 and model2 and run a step in both direcction on these models.
Let’s check an ANOVA table based on the above testing results.
step_b$anova## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 NA NA 8131 1651.461 1653.461
## 2 + MaturitySize -3 28.476583 8128 1622.984 1630.984
## 3 + PhotoAmt -1 8.622869 8127 1614.361 1624.361
## 4 + NumColors -2 9.334944 8125 1605.026 1619.026
## 5 + FurLength -2 8.468409 8123 1596.558 1614.558
## 6 + VideoAmt -1 3.823620 8122 1592.734 1612.734
## 7 + Quantity -1 3.391945 8121 1589.342 1611.342
From the above results, it shows that the best model is as follows:
summary(step_b)##
## Call:
## glm(formula = AdoptionSpeed ~ MaturitySize + PhotoAmt + NumColors +
## FurLength + VideoAmt + Quantity, family = binomial(link = "logit"),
## data = train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2917 0.1559 0.1864 0.2241 0.5909
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.92281 0.22695 12.878 < 2e-16 ***
## MaturitySize2 0.87295 0.18160 4.807 1.53e-06 ***
## MaturitySize3 0.40206 0.26610 1.511 0.13081
## MaturitySize4 12.41566 306.10603 0.041 0.96765
## PhotoAmt 0.09674 0.03354 2.884 0.00392 **
## NumColors2 0.27588 0.18456 1.495 0.13496
## NumColors3 -0.41276 0.20146 -2.049 0.04048 *
## FurLength2 -0.34512 0.16512 -2.090 0.03660 *
## FurLength3 -0.65913 0.27623 -2.386 0.01703 *
## VideoAmt -0.39208 0.17259 -2.272 0.02311 *
## Quantity 0.12471 0.07584 1.644 0.10008
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1651.5 on 8131 degrees of freedom
## Residual deviance: 1589.3 on 8121 degrees of freedom
## AIC: 1611.3
##
## Number of Fisher Scoring iterations: 14
We see how all the predictors are statistical significant, also, noticing how the variable are as statistical significant and the compared values.
The below plot shows our fitted models vs Density.
hist(step_b$fitted.values, main = " Histogram ",xlab = "Fitted models", col = 'skyblue3')Show the predicted values
train_data$Predict <- ifelse(step_b$fitted.values >0.5,"pos","neg")
head(train_data$Predict)## [1] "pos" "pos" "pos" "pos" "pos" "pos"
Here we can see the model works really well on the train data, we need to use it on the evaluation data and check the accuracy of the model on it.
Pulling the evaluation data from Github
evaluation <- read.csv("https://raw.githubusercontent.com/akarimhammoud/Data_621/main/Final%20Project/data/TestData/test_cleaned.csv")Transform the variables in the evaluation data.
evaluation <- transform(
evaluation,
Type=as.factor(Type),
Breed1=as.factor(Breed1),
Gender=as.factor(Gender),
Color1=as.factor(Color1),
Color2=as.factor(Color2),
Color3=as.factor(Color3),
MaturitySize=as.factor(MaturitySize),
FurLength=as.factor(FurLength),
Vaccinated = as.factor(Vaccinated),
Dewormed = as.factor(Dewormed),
Sterilized = as.factor(Sterilized),
Health = as.factor(Health),
NumColors = as.factor(NumColors)
)Here lets transform the evaluation data to fit the model where we used the train data.
#transformed variables
evaluation <- evaluation %>%
dplyr::select(Gender,MaturitySize ,FurLength, Vaccinated, Dewormed, Sterilized , Health , Quantity , Fee , VideoAmt , PhotoAmt , AgeYears , NumColors)#predictions of the model witht the cleaned evaluation data
#predictions
pred <- predict(model2, evaluation, type = "response")
summary(pred)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.6220 0.9710 0.9812 0.9770 0.9876 1.0000 1
here we can see the median 0.9812 and the mean 0.9770 based on the evaluation data, this is a really good prediction close the 100, and with running the test we can see the real value of this model.
# The predictions variable
pred_df <- as.data.frame(pred)
#Round the values to 4
pred_df$values <- round((exp(pred_df$pred)),4)
head(pred_df$values)## [1] 2.6551 2.6850 2.6772 2.6545 2.6539 2.6519
The values here shows the density of the adoption rate is more around 2.65
Checking the predictive values for the evaluation data.
# Creating Bar chart for the predictive values.
ggplot(pred_df, aes(x=values)) +
geom_histogram(aes(y=..density..), colour="black", fill="white")+
geom_density(alpha=.2, fill="steelblue")+
labs(title="Predicted Value",x="Adoption", y = "Density")We can see that the adoption values is more skewed to the right, the density of the adoption rate is more around 2.5 to 2.75