Approximately 3.1 million dogs are sheltered annually in the United States. At these shelters, about 1.2 million dogs are put down annually in the United States. Even the dogs which are not put down are often scared and confused, which makes them less adoptable due to their behavior during visits. However, there may be a way to determine the adoptability of canines, which shelters could potentially use to optimize their adoption process. This would result in more shelter dogs being homed and not being re-sheltered. The first step is to understand what makes specific dogs “adoptable” or not.
The data used for investigating the adaptability of shelter dogs is that of The Pudding’s Amber Thomas and Sacha Maxim. The method that the team plans to use is by cleaning, exploring, and analyzing said data such that correlations and patterns can be illustrated which may explain the trends by which Americans adopt dogs.
Our current proposed approach/analytic technique which we believe will help to address this problem is Frequent Subsequence Mining or Subset Mining. By filtering for certain traits of dogs and observing if these traits possibly contribute to their eventual adoption. For example, do Americans prefer to adopt dogs which can be bred (like those who have not been spayed or neutered), or would a dog be more likely to be adopted if there was no chance for unwanted pregnancy? Does location affect this? Perhaps there are more dog breeders in southern states as opposed to northern states. Which breeds of dogs are more likely to be adopted, and are adopters willing to import or export their pets to get the specific breed that they want? By wrangling this data and exploring these factors, perhaps a clearer understanding of a dog’s adaptability can be gleaned.
This analysis will help shelters enhance their adoption process to target certain American demographics with different desires for the dogs they choose to adopt. This way, adoption can be facilitated because the shelters can either make their dogs more adoptable by declawing, grooming, training, sterilizing, importing, exporting, etc. Improved adoptability of dogs will then theoretically lead to increased rates of adoption, which will decrease overall suffering of sheltered dogs.
The package which was used in this analysis and is therefore required to replicate it is tidyverse. We chose to use this package because it is comprised of several other packages, including: ggplot2 (for creating graphics), dplyr (for data manipulation), tidyr (to help clean the data), readr (to read rectangular data), tibble (to modify data frames), etc. Our team used the tidyverse package because we wanted to utilize the following commands which are available in the tidyverse package: readr::read_csv, tibble(), mutate(), etc.
We also used the hwardcloud and wordcloud2 packages to generate word clouds. Additionally, package tm was used for text mining, package SnowballC was used for text stemming, and package RColorBrewer was used for access to different color palettes. We use the devtools github package, including the libraries d3scatter and crosstalk to generate interactive plots.
# Load Libraries
library(tidyverse)
library(dplyr)
library(tidyr)
library(readr)
library(tibble)
library(usmap)
library(tm)
library(SnowballC)
library(wordcloud)
library(RColorBrewer)
library(crosstalk)
library(d3scatter)
Data for this project comes from Adoptable Dogs on the R for Data Science Github repository. [https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-12-17]
This data was originally used in a project by Amber Thomas and designed by Sacha Maxim. This data was included with their article, (Finding Forever Homes)[https://pudding.cool/2019/10/shelters/] and uses this adoptable dogs data from 09-20-2019 from Petfinder.com.
Included are three data sets. Two of which can be joined and at third that is a summary data set.
dog_moves <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-12-17/dog_moves.csv')
dog_travel <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-12-17/dog_travel.csv')
dog_descriptions <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-12-17/dog_descriptions.csv')
summary(dog_moves)
## location exported imported total
## Length:90 Min. : 1.00 Min. : 1.00 Min. : 15.0
## Class :character 1st Qu.: 1.00 1st Qu.: 11.25 1st Qu.: 477.5
## Mode :character Median : 4.00 Median : 24.50 Median : 857.0
## Mean : 30.37 Mean : 64.74 Mean :1139.8
## 3rd Qu.: 18.00 3rd Qu.: 73.25 3rd Qu.:1650.0
## Max. :635.00 Max. :390.00 Max. :4002.0
## NA's :9 NA's :52 NA's :39
## inUS
## Mode :logical
## FALSE:39
## TRUE :51
##
##
##
##
x <- dog_moves[is.na(dog_moves$total),]
print(tibble(x), n=10)
## # A tibble: 39 x 5
## location exported imported total inUS
## <chr> <dbl> <dbl> <dbl> <lgl>
## 1 Puerto Rico 131 NA NA FALSE
## 2 South Korea 76 NA NA FALSE
## 3 Mexico 54 NA NA FALSE
## 4 China 28 NA NA FALSE
## 5 Thailand 20 NA NA FALSE
## 6 India 18 NA NA FALSE
## 7 Taiwan 18 NA NA FALSE
## 8 Egypt 15 NA NA FALSE
## 9 Bahamas 13 NA NA FALSE
## 10 Spain 8 NA NA FALSE
## # ... with 29 more rows
As per description The number of adoptable dogs available in the US that originated in this location but were available for adoption in another location and vice versa for imported dogs. Total is simply total dogs available. For imported and exported we can replace missing values with 0, although as we don’t want to lose entire rows and possibly it might actually be 0 as well. Apart from Indianapolis rest of the values are foreign countries for total so it does not matter much as we are not focused on total dogs available for adoption outside the US. It will simply be transformed into 0. Logically and for our analysis, this would not have an impact.
dog_moves$total[is.na(dog_moves$total)] <- 0
dog_moves$imported[is.na(dog_moves$imported)] <- 0
dog_moves$exported[is.na(dog_moves$exported)] <- 0
We have missing values for remove and still there. We might not have this data collected but it does have to exist Animal removed from location is for remove. Still there is Whether the animal is still located in their origin location and will be transported to their final destination after adoption. Ideally interaction with source of data would clarify this. For remove we can assume Na is false which will imply animal is still present at location. Similarly for still there.These movements are for dogs in transit for adoption and anything else we can assume the large chuck of dogs are not moving and hence NA turned into FALSE.
dog_travel$remove[is.na(dog_travel$remove)] <- FALSE
dog_travel$still_there[is.na(dog_travel$still_there)] <- FALSE
summary(dog_travel)
## id contact_city contact_state description
## Min. : 8619716 Length:6194 Length:6194 Length:6194
## 1st Qu.:44940096 Class :character Class :character Class :character
## Median :45734990 Mode :character Mode :character Mode :character
## Mean :44786280
## 3rd Qu.:45923182
## Max. :46043149
## found manual remove still_there
## Length:6194 Length:6194 Mode :logical Mode :logical
## Class :character Class :character FALSE:4456 FALSE:5875
## Mode :character Mode :character TRUE :1738 TRUE :319
##
##
##
Declawed, color tertiary, photo, manual and tags are removed as nearly all values here are missing. Species is dog for every row so it adds no additional value.
dog_descriptions <- subset(dog_descriptions,select = -c(declawed,species,color_tertiary ,photo ,tags))
dog_travel <- subset(dog_travel,select = -c(manual))
Remaining missing values are all in character variables, the NA values can be replaced by UNKNOWN character string for models which don’t tolerate missing values.
colSums(is.na(dog_descriptions))
## id org_id url breed_primary breed_secondary
## 0 0 0 0 37359
## breed_mixed breed_unknown color_primary color_secondary age
## 0 0 32046 46121 0
## sex size coat fixed house_trained
## 0 0 30995 0 0
## special_needs shots_current env_children env_dogs env_cats
## 0 0 30153 23511 38828
## name status posted contact_city contact_state
## 0 0 33 0 0
## contact_zip contact_country stateQ accessed type
## 12 0 0 33 640
## description
## 8705
colSums(is.na(dog_travel))
## id contact_city contact_state description found
## 0 0 0 0 0
## remove still_there
## 0 0
colSums(is.na(dog_moves))
## location exported imported total inUS
## 0 0 0 0 0
In this section we evaluate the individual variables. Looking at the datatype and values in the attribute.
Evaluation for: * Is the datatype appropriate? * Are there data that will disrupt the analysis, including outliers?
The Process: 1. Run summary on the attribute. 2. Evaluate that data. Running additional reports and plots to inspect the data including but not limited to tidy queries, visually inspecting output and plots. 3. Run any cleanup or tidying that was appropriate. 4. Provide visual representation, table or other output to communicate the contents.
Included below all the attributes and the summary views that were chosen to represent the data. Where cleanup or other manipulation was needed, this code is included and explained in more detail.
location - character - The full name of the US state or country
head(dog_moves$location)
## [1] "Texas" "Alabama" "North Carolina" "South Carolina"
## [5] "Georgia" "Puerto Rico"
exported - double - The number of adoptable dogs available in the US that originated in this location but were available for adoption in another location
summary(dog_moves$exported)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 1.00 3.00 27.33 16.50 635.00
imported - double - The number of adoptable dogs available in this state that originated in a different location
summary(dog_moves$imported)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 27.33 15.50 390.00
total - double - The total number of adoptable dogs available in a given state.
summary(dog_moves$total)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 66.5 645.9 939.5 4002.0
inUS - logical - Whether or not a location is in the US or not. Here, US territories will return FALSE
summary(dog_moves$inUS)
## Mode FALSE TRUE
## logical 39 51
dim(dog_travel)
## [1] 6194 7
id - double - The unique PetFinder identification number for each animal
quantile(dog_travel$id)
## 0% 25% 50% 75% 100%
## 8619716 44940096 45734990 45923182 46043149
contact_city - character - The rescue/shelter’s listed city
head(dog_travel$contact_city)
## [1] "Anoka" "Groveland" "Adamstown" "Saint Cloud" "Pueblo"
## [6] "Manchester"
contact_state - character - The rescue/shelter’s listed State
head(dog_travel$contact_state)
## [1] "MN" "FL" "MD" "MN" "CO" "CT"
description - character - The full description of each animal as entered by the rescue/shelter
head(dog_travel$description, 1)
## [1] "Boris is a handsome mini schnauzer who made his long trek up her from Arkansas on 4/2019. He loves rope toys and running around his foster mom's yard. \n\nHe is 4 years old, just under 10 pounds, and is a special needs dog. He needs vitamin B-12 shots every other week (may change), and needs prescription dog food with no meet or protein. Boris is on a vegan diet.\n\nIf interested in Boris please fill out or adoption application so we can set up a meet and greet for you! His foster mom can explain more in-depth about his needs and can answer any questions you may have about his lifestyle.\n\nBoris' adoption fee is $300 and the application can be found at http://www.aohrescue.org/dog-adoption-application"
found - character - Where the animal was found. Note: this is a mixed bag of values States/countries/cities
head(dog_travel$found, 5)
## [1] "Arkansas" "Abacos" "Adam" "Adaptil" "Afghanistan"
manual - character - not defined in the data dictionary REMOVED in a prior step
remove - logical - Animal removed from location
summary(dog_travel$remove)
## Mode FALSE TRUE
## logical 4456 1738
still_there - logical - TRUE/FALSE - Whether the animal is still located in their origin location and will be transported to their final destination after adoption.
summary(dog_travel$still_there)
## Mode FALSE TRUE
## logical 5875 319
####Dataset dimensions
dim(dog_descriptions)
## [1] 58180 31
id - double - The unique PetFinder identification number for each animal.
quantile(dog_descriptions$id)
## 0% 25% 50% 75% 100%
## 604115 44516882 45596624 45916674 46043149
org_id - character - The unique identification number for each shelter or rescue.
head(dog_descriptions$org_id, 5)
## [1] "NV163" "NV163" "NV99" "NV202" "NV184"
url - character - The URL for each animal’s listing.
head(dog_descriptions$url, 5)
## [1] "https://www.petfinder.com/dog/harley-46042150/nv/las-vegas/animal-network-nv163/?referrer_id=87b31e7d-4508-41d1-95ff-fdb59b9d4669"
## [2] "https://www.petfinder.com/dog/biggie-46042002/nv/las-vegas/animal-network-nv163/?referrer_id=87b31e7d-4508-41d1-95ff-fdb59b9d4669"
## [3] "https://www.petfinder.com/dog/ziggy-46040898/nv/mesquite/city-of-mesquite-animal-shelter-nv99/?referrer_id=87b31e7d-4508-41d1-95ff-fdb59b9d4669"
## [4] "https://www.petfinder.com/dog/gypsy-46039877/nv/pahrump/pets-are-worth-saving-paws-nv202/?referrer_id=87b31e7d-4508-41d1-95ff-fdb59b9d4669"
## [5] "https://www.petfinder.com/dog/theo-46039306/nv/henderson/wagging-tails-rescue-nv184/?referrer_id=87b31e7d-4508-41d1-95ff-fdb59b9d4669"
species - character - Species of animal. REMOVED in prior step (all rows are Dogs)
breed_primary - character - The primary (assumed) breed assigned by the shelter or rescue.
head(table(dog_descriptions$breed_primary))
##
## Affenpinscher Afghan Hound Airedale Terrier Akbash
## 17 4 19 3
## Akita Alaskan Malamute
## 181 72
breed_secondary - character - The secondary (assumed) breed assigned by the shelter or rescue.
head(table(dog_descriptions$breed_secondary))
##
## Affenpinscher Afghan Hound Airedale Terrier Akbash
## 18 1 6 10
## Akita Alaskan Malamute
## 28 13
breed_mixed - logical - Whether or not an animal is presumed to be mixed breed.
summary(dog_descriptions$breed_mixed)
## Mode FALSE TRUE
## logical 16589 41591
breed_unknown - logical - Whether or not the animal’s breed is completely unknown. REMOVED - These were all FALSE so attribute is removed
table(dog_descriptions$breed_unknown)
##
## FALSE
## 58180
dog_descriptions <-
dog_descriptions %>%
mutate(breed_unknown = NULL)
color_primary - character - The most prevalent color of an animal.
table(dog_descriptions$color_primary)
##
## Apricot / Beige Bicolor
## 1322 1415
## Black Brindle
## 7639 1953
## Brown / Chocolate Golden
## 2434 817
## Gray / Blue / Silver Harlequin
## 1174 73
## Merle (Blue) Merle (Red)
## 186 142
## Red / Chestnut / Orange Sable
## 1652 159
## Tricolor (Brown, Black, & White) White / Cream
## 2094 3134
## Yellow / Tan / Blond / Fawn
## 1940
color_secondary - character - The second most prevalent color of an animal.
table(dog_descriptions$color_secondary)
##
## Apricot / Beige Bicolor
## 45 55
## Black Brindle
## 497 269
## Brown / Chocolate Golden
## 801 184
## Gray / Blue / Silver Harlequin
## 296 37
## Merle (Blue) Merle (Red)
## 33 29
## Red / Chestnut / Orange Sable
## 339 34
## Tricolor (Brown, Black, & White) White / Cream
## 124 7285
## Yellow / Tan / Blond / Fawn
## 2031
color_tertiary - character - The third most prevalent color of an animal. REMOVED in a prior step
age - character - The assumed age class of an animal (Baby, Young, Adult, or Senior).
table(dog_descriptions$age)
##
## Adult Baby Senior Young
## 27955 9397 4634 16194
sex - character - The sex of an animal (Female, Male, or Unknown).
table(dog_descriptions$sex)
##
## Female Male Unknown
## 27883 30294 3
size - character - The general size class of an animal (Small, Medium, Large, Extra Large).
table(dog_descriptions$size)
##
## Extra Large Large Medium Small
## 931 15761 29908 11580
coat - character - Coat Length for each animal (Curly, Hairless, Long, Medium, Short, Wire).
table(dog_descriptions$coat)
##
## Curly Hairless Long Medium Short Wire
## 135 16 1185 4927 20671 251
#barplot(table(dog_descriptions$coat))
fixed - logical - Whether or not an animal has been spayed/neutered.
table(dog_descriptions$fixed)
##
## FALSE TRUE
## 11559 46621
house_trained - logical - Whether or not an animal is trained to not go to the bathroom in the house.
table(dog_descriptions$house_trained)
##
## FALSE TRUE
## 37624 20556
declawed - logical - Whether or not the animal has had its dewclaws removed. REMOVED in a prior step all were FALSE
special_needs - logical - Whether or not the animal is considered to have special needs (this can be a long-term medical condition or particular temperament that requires extra care).
table(dog_descriptions$special_needs)
##
## FALSE TRUE
## 56034 2146
shots_current - logical - Whether or not the animal is up to date on all of their vaccines and other shots.
table(dog_descriptions$shots_current)
##
## FALSE TRUE
## 16112 42068
env_children - logical - Whether or not the animal is recommended for a home with children.
table(dog_descriptions$env_children)
##
## FALSE TRUE
## 4439 23588
env_dogs - logical - Whether or not the animal is recommended for a home with other dogs.
table(dog_descriptions$env_dogs)
##
## FALSE TRUE
## 3547 31122
env_cats - logical - Whether or not the animal is recommended for a home with cats.
table(dog_descriptions$env_cats)
##
## FALSE TRUE
## 6810 12542
name - character - The animal’s name (as given by the shelter/rescue). This data was cleaned up a bit. There were Identification number and other extra information that for the analysis that we are looking to perform will be distracting and confusing to the end users. It might be necessary to do additional cleanup as we proceed. We are starting by getting the numbers that would be meaningless to the end user cleaned up.
The process is using regular expressions to update the values with just an id number to “Unknown”. Note: includes some notes Like “sponsored” “courtesy post” Id numbers of some kind
head(dog_descriptions$name, 20)
## [1] "HARLEY" "BIGGIE" "Ziggy" "Gypsy" "Theo"
## [6] "Oliver" "Macadamia" "Dodger" "Huckleberry" "Fagin"
## [11] "Speckles" "Cashew" "Dash" "Sydney" "HENNA"
## [16] "RUBO" "LEGO" "MARIE" "RINGO" "TONY"
dim(dog_descriptions)
## [1] 58180 30
dog_descriptions <- dog_descriptions %>%
mutate(name = str_to_title(name))
dog_descriptions <- dog_descriptions %>%
mutate(name = ifelse( str_detect(name, "^[0-9]*$"), "Unknown", name) ) %>%
mutate(name = ifelse( !str_detect(name, "[A-Z][A-z]*"), "Unknown", name) ) %>%
mutate(name = ifelse( str_detect(name, "^A[0-9]"), "Unknown", name) )
tags - character - Any tags given to the dog by the shelter rescue (pipe | separated). REMOVED in a prior step
photo - character - The URL to the animal’s primary photo. REMOVED in a prior step
status - character - Whether the animal is adoptable or not.
summary(dog_descriptions$status)
## Length Class Mode
## 58180 character character
posted - character - The date that this animal was first listed on PetFinder .
This single data-time attribute has been separated into separate attributes for each of the periods, posted_year, posted_month, …
dog_descriptions <-
dog_descriptions %>%
separate(posted, c('posted_year', 'posted_month', 'posted_day', 'posted_hour', 'posted_min', 'posted_sec'), sep = "([- :])")
dog_descriptions %>%
select('posted_year', 'posted_month', 'posted_day', 'posted_hour', 'posted_min', 'posted_sec') %>%
head()
## # A tibble: 6 x 6
## posted_year posted_month posted_day posted_hour posted_min posted_sec
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2019 09 20 16 37 59
## 2 2019 09 20 16 24 57
## 3 2019 09 20 14 10 11
## 4 2019 09 20 10 08 22
## 5 2019 09 20 06 48 30
## 6 2019 09 20 06 43 59
contact_city - character - The rescue/shelter’s listed city.
Clean up: This field had different issues with cases. This effecting the way that the cities were grouped together.
The process used was a mapping tribble and then joining this to the original data and updating. Another way would have been to change the attribute to Title case throughout. This would have taken less code but would not ensure that all the modifications would have been approprtate as the menthod allows. For example Washington DC would be converted to Washington Dc if title case would have been used.
map <- tribble(
~contact_city, ~adj_city,
"abingdon", "Abingdon",
"lake charles", "Lake Charles",
"JERSEY CITY", "Jersey City",
"aitkin", "Aitkin",
"cary", "Cary",
"carson", "Carson",
"BROOKLYN", "Brooklyn",
"denver", "Denver",
"grandville", "Grandville",
"GREER","Greer",
"BERLIN", "Berlin",
"YERINGTON", "Yerington",
"SACRAMENTO","Sacramento",
"LINDSAY","Lindsay",
"MONUMENT","Monument",
"WHEAT RIDGE","Wheat Ridge",
"NASSAU","Nassau",
"NEWARK","Newark",
"MARION","Marion",
"TROUTMAN","Troutman",
"JIM THORPE","Jim Thorpe",
"POCONO SUMMIT","Pocono Summit",
"MARYSVILLE","Marysville",
"reno", "Reno",
"mount vernon","Mount Vernon",
"modesto","Modesto",
"ridgefield","Ridgefield",
"taunton","Taunton",
"garden city park","Garden City Park",
"churchton","Churchton",
"huntingtown","Huntingtown",
"fayetteville","Fayetteville",
"canton","Canton",
"seattle","Seattle"
)
dog_descriptions <-
dog_descriptions %>%
left_join(map) %>%
mutate(contact_city = if_else(is.na(adj_city), contact_city, adj_city))
## Joining, by = "contact_city"
dog_descriptions %>% distinct(contact_city) %>% head()
## # A tibble: 6 x 1
## contact_city
## <chr>
## 1 Las Vegas
## 2 Mesquite
## 3 Pahrump
## 4 Henderson
## 5 Bullhead City
## 6 Kingman
contact_state - character - The rescue/shelter’s listed state.
Noted that there are numeric values in the contact_state attribute. These apprear to be zipcodes however are not likely to effect the analysis. So we have left them in for now.
table(dog_descriptions$contact_state)
##
## 12220 12477 17325 19053 19063 20136 20905 23112 24588 37189 38506 45061 45249
## 3 2 2 1 1 1 1 1 1 1 1 2 1
## 46158 47131 47454 61944 70601 85249 87108 89146 98106 AK AL AR AZ
## 1 1 1 1 1 1 1 1 7 15 1428 695 2248
## CA CO CT DC DE FL GA HI IA ID IL IN KS
## 1664 1773 1422 336 296 2659 3479 69 485 49 1114 1877 470
## KY LA MA MD ME MI MN MO MS MT NB NC ND
## 1123 912 946 1493 545 673 958 920 510 18 2 2627 64
## NE NH NJ NM NV NY OH OK OR PA QC RI SC
## 120 335 3022 636 857 4002 2670 1636 91 2821 14 607 1618
## SD TN TX UT VA VT WA WI WV WY
## 24 1769 566 485 3058 510 1277 542 565 52
contact_zip - character - The rescue/shelter’s listed zip code.
summary(dog_descriptions$contact_zip)
## Length Class Mode
## 58180 character character
contact_country - character - The rescue/shelter’s listed country. Noted that there are a few countries that would seem to have state codes entered. These seem to be related to the similar issue seen in the state attribute above.
table(dog_descriptions$contact_country)
##
## 89009 AZ CA CT DC DE IL IN KY LA NM TN US
## 1 1 16 5 4 2 1 3 3 1 1 2 58131
## VA WA
## 2 7
stateQ - character - The state abbreviation queried in the API to return this result .
table(dog_descriptions$stateQ)
##
## 2019-09-20 89009 89014 89024 89027 89121 89406
## 33 746 1 284 2 25 181
## 89408 89423 89431 89451 89801 AK AL
## 8 435 37 134 5 15 1043
## AR AZ CA CO CT DC DE
## 695 2169 1028 1773 6725 4665 2365
## FL GA HI IA ID IL IN
## 2659 3439 69 485 49 1118 1917
## KS KY LA MA MD ME MI
## 470 1670 912 909 36 134 673
## MN MO MS MT NC ND NE
## 957 920 510 18 3085 64 120
## NH NJ NM NY OH OK OR
## 382 1863 636 669 1901 1636 91
## PA RI SC SD TN TX UT
## 337 1 1137 24 2323 566 199
## VA VT WA WI WV WY
## 1165 118 1277 543 677 52
accessed - double - The date that this data was acquired from the PetFinder API.
This date was separated into is date components.
dog_descriptions <-
dog_descriptions %>% separate(accessed, c('accessed_year', 'accessed_month', 'accessed_day' ), sep = '-')
dog_descriptions %>%
select(accessed_year, accessed_month, accessed_day) %>%
head()
## # A tibble: 6 x 3
## accessed_year accessed_month accessed_day
## <chr> <chr> <chr>
## 1 2019 09 20
## 2 2019 09 20
## 3 2019 09 20
## 4 2019 09 20
## 5 2019 09 20
## 6 2019 09 20
type - character - The type of animal.
table(dog_descriptions$type)
##
## Dog
## 57540
description - character - The full description of an animal, as entered by the rescue or shelter. This is the only field returned by the V1 API.
head(dog_descriptions$description,1)
## [1] "Harley is not sure how he wound up at shelter in his senior years but as you see from the pictures the shelter asked if we could find a real home for this active senior boy. You would never know he is 9 years old. Very playful and loves humans and all the dogs he has met at adoptions he seems to like. He is 59 lbs so still pretty strong for a senior but loves walks and you have to love his ears. If you would like to meet this happy go lucky boy please contact AdoptAnimalNetwork@gmail.com. Updated pictures this Sunday."
We will combine dataset dog description and dog travelas a new data set and summarize the information to analyze or compare summary information from dataset dog move.
We will start EDA by exploring and visualizing the variables within each data sets.
We build interactive plots for the country a dog is in to provide information to the user.
Dog Location
library(crosstalk)
library(d3scatter)
shared_dogmoves <- SharedData$new(dog_moves)
bscols(widths = c(3,NA,NA),
list(
filter_checkbox("ctr", "In the US?", shared_dogmoves, ~inUS, inline = TRUE),
filter_select("st", "Location Name", shared_dogmoves, ~location)
),
d3scatter(shared_dogmoves, ~total, ~exported, ~factor(inUS), width="100%", height=250)
)
## Warning in bscols(widths = c(3, NA, NA), list(filter_checkbox("ctr", "In the
## US?", : Too many widths provided to bscols; truncating
Above we can see a plot comparing the total number of dogs plotted against the total number of exported dogs in our dataset with an interactive filter to view only dogs in the US vs. those outside of the US, as well as a dropdown location selector. We could incorporate a more advanced version of this inside of our shiny app to give insight as to the total number of dogs in a given location, as well as whether they are generally exported or imported.
In US
barplot(table(dog_moves$inUS), ylab = "Count of Dogs", main = "# of Dogs in vs. not in the US")
Exported Dogs
barplot(dog_moves$total ~ dog_moves$location, xlab = "location", ylab = "# of Exported Dogs", main = "# of Exported Dogs by Location", las = 2)
Imported Dogs
barplot(dog_moves$imported ~ dog_moves$location, xlab = "Location", ylab = "# of Dogs", main = "# of Imported Dogs by Location", las = 2)
Total Dogs
barplot(dog_moves$total ~ dog_moves$location, xlab = "Location", ylab = "# of Dogs", main = "Total # of Dogs by Location", las = 2)
We build word cloud on variabledescription as it’s mainly text and wishing to gain more information from the text mining.
We also intend to gain geographical visualization of travel summarized by contact state contact state
travel<- dog_travel%>%
group_by(contact_state)%>%
count()
travel<- is.numeric(travel$n)
#plot_usmap(data = travel,
# values = travel$n,
# include = c("AK", "AL", "AR", "AZ", "CA", #"CO","CT","DC","DE","FL","GA","HI","IA","ID","IL","IN","KS","KY#","LA","MA","MD","ME","MI","MN","MO","MS","MT","NB","NV","ND","#NE","NH","NJ","NM","NV","NY","OH","OK","OR","PA","QC","RI","SC"#,"SD","TN","TX","UT","VA","VT","WA","WI","WV","WY"),
# color = "red") +
# scale_fill_continuous(
# low = "white", high = "red", name = "travel contact state", #label = scales::comma) +
# labs(title = "US States", subtitle = "Travel group by #information")
#plot_usmap(data = dog_travel, values = table(contact_state), color = "red") +
# scale_fill_continuous(name = "Contact State",
# label = scales::comma) +
# theme(legend.position = "right")
Have problem generating the map. Later we could also do similar with found.
description This variable contents only text, so generated a word cloud to see the most frequent word.
# Load the data as a corpus
docs <- Corpus(VectorSource(dog_travel$description))
#inspect the content of docs
inspect(docs)
#text transformation
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "@")
docs <- tm_map(docs, toSpace, "\\|")
docs <- tm_map(docs, toSpace, "~")
docs <- tm_map(docs, toSpace, "~~")
#text cleaning
# Convert the text to lower case
docs <- tm_map(docs, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(docs, content_transformer(tolower)):
## transformation drops documents
# Remove numbers
docs <- tm_map(docs, removeNumbers)
## Warning in tm_map.SimpleCorpus(docs, removeNumbers): transformation drops
## documents
# Remove english common stopwords
docs <- tm_map(docs, removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(docs, removeWords, stopwords("english")):
## transformation drops documents
# Remove your own stop word
# specify your stopwords as a character vector
docs <- tm_map(docs, removeWords, c("blabla1", "blabla2"))
## Warning in tm_map.SimpleCorpus(docs, removeWords, c("blabla1", "blabla2")):
## transformation drops documents
# Remove punctuations
docs <- tm_map(docs, removePunctuation)
## Warning in tm_map.SimpleCorpus(docs, removePunctuation): transformation drops
## documents
# Eliminate extra white spaces
docs <- tm_map(docs, stripWhitespace)
## Warning in tm_map.SimpleCorpus(docs, stripWhitespace): transformation drops
## documents
# Text stemming
# docs <- tm_map(docs, stemDocument)
#build text- document matrix
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 1)
set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
found
barplot(table(dog_travel$found))
Maybe we can do parallel matching with
contact city/ state later, or map by state and do comparision. At present, we can not achieve this as the variable is described asAlabamainstead of AL. The work need to be done is: word matching, if it’s state in U.S., we will match it to contact state, which is two letter form; if it’s city data, we will match it to contact city; if it’s not in the U.S., we will list the country and combine with dog_moves dataset later.
remove
barplot(table(dog_travel$remove))
still_there
barplot(table(dog_travel$still_there))
We intend to used bar chart/ pie chart on categorical variables eg, breed, color, sex. We also intend to use parallel coordinates on variables foundand contact state/city.
On top of that, we intend to gain new information by slicing and grouping by the state, we can get more summarized information by state and then look into information within each state.
par(mfrow=c(2,2))
barplot(table(dog_descriptions$breed_primary))
barplot(table(dog_descriptions$sex))
barplot(table(dog_descriptions$age))
barplot(table(dog_descriptions$color_primary))
** color primary**
# Increase margin size
par(mar=c(13,4,4,4))
barplot(table(dog_descriptions$color_secondary),las=2)
color secondary
# Increase margin size
par(mar=c(13,4,4,4))
barplot(table(dog_descriptions$color_secondary),las=2)
coat
barplot(table(dog_descriptions$coat))
breed_df<- dog_descriptions%>%
group_by(contact_state,breed_primary)%>%
count()%>%
arrange(contact_state, desc(n))
#table(breed_df$contact_state)
we also intend to visualize different categories by state.
#plot_usmap(
# data = breed_df, values = n, include = c("AK", "AL", "AR", #"AZ", "CA", "CO","CT","DC","DE","FL","GA","HI","IA","ID","IL","#IN","KS","KY","LA","MA","MD","ME","MI","MN","MO","MS","MT","NB"#,"NV","ND","NE","NH","NJ","NM","NV","NY","OH","OK","OR","PA","Q#C","RI","SC","SD","TN","TX","UT","VA","VT","WA","WI","WV","WY")#, color = "red"
# ) +
# scale_fill_continuous(
# low = "white", high = "red", name = "Breed in", label = #scales::comma
# ) +
# labs(title = "Western US States", subtitle = "These are the #states in the Pacific Timezone.")
We still need to fix the code on this one.
From the exploratory data analysis by visualization above. We want to create more explicit and interactive visualization, which could not only help our analysis but more user-friendly.
We want to build an interactive visualization using shiny. We will be working on all three data sets. We will be able to let user to get more generalized information by state. Additionally, when a certain state is selected, more detailed information about dogs in the state could display. For example, the imported/ exported dogs and their names, breeds, origins, age, etc.
{r shiny, echo = FALSE} selectInput(“data”, "“, c(”co2“,”lh"))
See a plot:
``{r echo = FALSE}
renderPlot({
d <- get(input$data)
plot(d) })
We are trying to do logistic regression analysis on the combined data of dog description and dog moves. We will split the dataset into train and test data set.
What we need right now that are unachievable is effective text mining, as we find some information like adoption fee and tranfer fee are contained within the description variable of dog description dataset. They could play a role in the model as well.
Looking into the word cloud we generated, the most frequent words that have been mentioned are not very useful for our adoption decision evaluation. We could further explore the presence of certain types of words in certain types of dogs to recommend similar records to a user interacting with our Shiny app.
We decide to try with Frequent itemset mining (FPA).
| Variable | Class | description |
|---|---|---|
location |
character | The full name of the US state or country |
exported |
double | The number of adoptable dogs available in the US that originated in this location but were available for adoption in another location |
imported |
double | The number of adoptable dogs available in this state that originated in a different location |
total |
double | The total number of adoptable dogs availabe in a given state. |
inUS |
logical | Whether or not a location is in the US or not. Here, US territories will return FALSE |
| Variable | Class | description |
|---|---|---|
id |
double | The unique PetFinder identification number for each animal |
contact_city |
character | The rescue/shelter’s listed city |
contact_state |
character | The rescue/shelter’s listed State |
description |
character | The full description of each animal as entered by the rescue/shelter |
found |
character | Where the animal was found. |
manual |
character | . |
remove |
logical | Animal removed from location |
still_there |
logical TRUE/FALSE | - Whether the animal is still located in their origin location and will be transported to their final destination after adoption. |
| Variable | Class | description |
|---|---|---|
id |
double | The unique PetFinder identification number for each animal. |
org_id |
character | The unique identification number for each shelter or rescue. |
url |
character | The URL for each animal’s listing. |
species |
character | Species of animal. |
breed_primary |
character | The primary (assumed) breed assigned by the shelter or rescue. |
breed_secondary |
character | The secondary (assumed) breed assigned by the shelter or rescue. |
breed_mixed |
logical | Whether or not an animal is presumed to be mixed breed. |
breed_unknown |
logical | Whether or not the animal’s breed is completely unknown. |
color_primary |
character | The most prevalent color of an animal. |
color_secondary |
character | The second most prevalent color of an animal. |
color_tertiary |
character | The third most prevalent color of an animal. |
age |
character | The assumed age class of an animal (Baby, Young, Adult, or Senior). |
sex |
character | The sex of an animal (Female, Male, or Unknown). |
size |
character | The general size class of an animal (Small, Medium, Large, Extra Large). |
coat |
character | Coat Length for each animal (Curly, Hairless, Long, Medium, Short, Wire). |
fixed |
logical | Whether or not an animal has been spayed/neutered. |
house_trained |
logical | Whether or not an animal is trained to not go to the bathroom in the house. |
declawed |
logical | Whether or not the animal has had its dewclaws removed. |
special_needs |
logical | Whether or not the animal is considered to have special needs (this can be a long-term medical condition or particular temperament that requires extra care). |
shots_current |
logical | Whether or not the animal is up to date on all of their vaccines and other shots. |
env_children |
logical | Whether or not the animal is recommended for a home with children. |
env_dogs |
logical | Whether or not the animal is recommended for a home with other dogs. |
env_cats |
logical | Whether or not the animal is recommended for a home with cats. |
name |
character | The animal’s name (as given by the shelter/rescue). |
tags |
character | Any tags given to the dog by the shelter rescue (pipe |
photo |
character | The URL to the animal’s primary photo. |
status |
character | Whether the animal is adoptable or not. |
posted |
character | The date that this animal was first listed on PetFinder . |
contact_city |
character | The rescue/shelter’s listed city. |
contact_state |
character | The rescue/shelter’s listed state. |
contact_zip |
character | The rescue/shelter’s listed zip code. |
contact_country |
character | The rescue/shelter’s listed country. |
stateQ |
character | The state abbreviation queried in the API to return this result . |
accessed |
double | The date that this data was acquired from the PetFinder API. |
type |
character | The type of animal. |
description |
character | The full description of an animal, as entered by the rescue or shelter. This is the only field returned by the V1 API. |