1. Introduction

1.1 Problem Statement

Approximately 3.1 million dogs are sheltered annually in the United States. At these shelters, about 1.2 million dogs are put down annually in the United States. Even the dogs which are not put down are often scared and confused, which makes them less adoptable due to their behavior during visits. However, there may be a way to determine the adoptability of canines, which shelters could potentially use to optimize their adoption process. This would result in more shelter dogs being homed and not being re-sheltered. The first step is to understand what makes specific dogs “adoptable” or not.

The data used for investigating the adaptability of shelter dogs is that of The Pudding’s Amber Thomas and Sacha Maxim. The method that the team plans to use is by cleaning, exploring, and analyzing said data such that correlations and patterns can be illustrated which may explain the trends by which Americans adopt dogs.

Our current proposed approach/analytic technique which we believe will help to address this problem is Frequent Subsequence Mining or Subset Mining. By filtering for certain traits of dogs and observing if these traits possibly contribute to their eventual adoption. For example, do Americans prefer to adopt dogs which can be bred (like those who have not been spayed or neutered), or would a dog be more likely to be adopted if there was no chance for unwanted pregnancy? Does location affect this? Perhaps there are more dog breeders in southern states as opposed to northern states. Which breeds of dogs are more likely to be adopted, and are adopters willing to import or export their pets to get the specific breed that they want? By wrangling this data and exploring these factors, perhaps a clearer understanding of a dog’s adaptability can be gleaned.

This analysis will help shelters enhance their adoption process to target certain American demographics with different desires for the dogs they choose to adopt. This way, adoption can be facilitated because the shelters can either make their dogs more adoptable by declawing, grooming, training, sterilizing, importing, exporting, etc. Improved adoptability of dogs will then theoretically lead to increased rates of adoption, which will decrease overall suffering of sheltered dogs.

1.2 Packages Required

The package which was used in this analysis and is therefore required to replicate it is tidyverse. We chose to use this package because it is comprised of several other packages, including: ggplot2 (for creating graphics), dplyr (for data manipulation), tidyr (to help clean the data), readr (to read rectangular data), tibble (to modify data frames), etc. Our team used the tidyverse package because we wanted to utilize the following commands which are available in the tidyverse package: readr::read_csv, tibble(), mutate(), etc.

We also used the hwardcloud and wordcloud2 packages to generate word clouds. Additionally, package tm was used for text mining, package SnowballC was used for text stemming, and package RColorBrewer was used for access to different color palettes. We use the devtools github package, including the libraries d3scatter and crosstalk to generate interactive plots.

1.3 prepare and load package

# Load Libraries
library(tidyverse)
library(dplyr)
library(tidyr)
library(readr)
library(tibble)

library(usmap)
library(tm)
library(SnowballC)
library(wordcloud)
library(RColorBrewer)

library(crosstalk)
library(d3scatter)

2. Data Cleaning

2.1 Data Source

Data for this project comes from Adoptable Dogs on the R for Data Science Github repository. [https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-12-17]

This data was originally used in a project by Amber Thomas and designed by Sacha Maxim. This data was included with their article, (Finding Forever Homes)[https://pudding.cool/2019/10/shelters/] and uses this adoptable dogs data from 09-20-2019 from Petfinder.com.

Included are three data sets. Two of which can be joined and at third that is a summary data set.

Load Data and library

dog_moves <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-12-17/dog_moves.csv')
dog_travel <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-12-17/dog_travel.csv')
dog_descriptions <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-12-17/dog_descriptions.csv')

Remove Missing Values

summary(dog_moves)
##    location            exported         imported          total       
##  Length:90          Min.   :  1.00   Min.   :  1.00   Min.   :  15.0  
##  Class :character   1st Qu.:  1.00   1st Qu.: 11.25   1st Qu.: 477.5  
##  Mode  :character   Median :  4.00   Median : 24.50   Median : 857.0  
##                     Mean   : 30.37   Mean   : 64.74   Mean   :1139.8  
##                     3rd Qu.: 18.00   3rd Qu.: 73.25   3rd Qu.:1650.0  
##                     Max.   :635.00   Max.   :390.00   Max.   :4002.0  
##                     NA's   :9        NA's   :52       NA's   :39      
##     inUS        
##  Mode :logical  
##  FALSE:39       
##  TRUE :51       
##                 
##                 
##                 
## 
x <- dog_moves[is.na(dog_moves$total),]
print(tibble(x), n=10)
## # A tibble: 39 x 5
##    location    exported imported total inUS 
##    <chr>          <dbl>    <dbl> <dbl> <lgl>
##  1 Puerto Rico      131       NA    NA FALSE
##  2 South Korea       76       NA    NA FALSE
##  3 Mexico            54       NA    NA FALSE
##  4 China             28       NA    NA FALSE
##  5 Thailand          20       NA    NA FALSE
##  6 India             18       NA    NA FALSE
##  7 Taiwan            18       NA    NA FALSE
##  8 Egypt             15       NA    NA FALSE
##  9 Bahamas           13       NA    NA FALSE
## 10 Spain              8       NA    NA FALSE
## # ... with 29 more rows

Treatment of missing values.

As per description The number of adoptable dogs available in the US that originated in this location but were available for adoption in another location and vice versa for imported dogs. Total is simply total dogs available. For imported and exported we can replace missing values with 0, although as we don’t want to lose entire rows and possibly it might actually be 0 as well. Apart from Indianapolis rest of the values are foreign countries for total so it does not matter much as we are not focused on total dogs available for adoption outside the US. It will simply be transformed into 0. Logically and for our analysis, this would not have an impact.

dog_moves$total[is.na(dog_moves$total)] <- 0
dog_moves$imported[is.na(dog_moves$imported)] <- 0
dog_moves$exported[is.na(dog_moves$exported)] <- 0

We have missing values for remove and still there. We might not have this data collected but it does have to exist Animal removed from location is for remove. Still there is Whether the animal is still located in their origin location and will be transported to their final destination after adoption. Ideally interaction with source of data would clarify this. For remove we can assume Na is false which will imply animal is still present at location. Similarly for still there.These movements are for dogs in transit for adoption and anything else we can assume the large chuck of dogs are not moving and hence NA turned into FALSE.

dog_travel$remove[is.na(dog_travel$remove)] <- FALSE
dog_travel$still_there[is.na(dog_travel$still_there)] <- FALSE
summary(dog_travel)
##        id           contact_city       contact_state      description       
##  Min.   : 8619716   Length:6194        Length:6194        Length:6194       
##  1st Qu.:44940096   Class :character   Class :character   Class :character  
##  Median :45734990   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :44786280                                                           
##  3rd Qu.:45923182                                                           
##  Max.   :46043149                                                           
##     found              manual            remove        still_there    
##  Length:6194        Length:6194        Mode :logical   Mode :logical  
##  Class :character   Class :character   FALSE:4456      FALSE:5875     
##  Mode  :character   Mode  :character   TRUE :1738      TRUE :319      
##                                                                       
##                                                                       
## 

Declawed, color tertiary, photo, manual and tags are removed as nearly all values here are missing. Species is dog for every row so it adds no additional value.

dog_descriptions <- subset(dog_descriptions,select = -c(declawed,species,color_tertiary ,photo ,tags))
dog_travel <- subset(dog_travel,select = -c(manual))

Checking for counts of NA’s by attribute

Remaining missing values are all in character variables, the NA values can be replaced by UNKNOWN character string for models which don’t tolerate missing values.

NA counts in dog_description dataset

colSums(is.na(dog_descriptions))
##              id          org_id             url   breed_primary breed_secondary 
##               0               0               0               0           37359 
##     breed_mixed   breed_unknown   color_primary color_secondary             age 
##               0               0           32046           46121               0 
##             sex            size            coat           fixed   house_trained 
##               0               0           30995               0               0 
##   special_needs   shots_current    env_children        env_dogs        env_cats 
##               0               0           30153           23511           38828 
##            name          status          posted    contact_city   contact_state 
##               0               0              33               0               0 
##     contact_zip contact_country          stateQ        accessed            type 
##              12               0               0              33             640 
##     description 
##            8705

NA counts in dog_travel dataset

colSums(is.na(dog_travel))
##            id  contact_city contact_state   description         found 
##             0             0             0             0             0 
##        remove   still_there 
##             0             0

NA counts in dog_moves dataset

colSums(is.na(dog_moves))
## location exported imported    total     inUS 
##        0        0        0        0        0

Evaluation of the data types and values

In this section we evaluate the individual variables. Looking at the datatype and values in the attribute.

Evaluation for: * Is the datatype appropriate? * Are there data that will disrupt the analysis, including outliers?

The Process: 1. Run summary on the attribute. 2. Evaluate that data. Running additional reports and plots to inspect the data including but not limited to tidy queries, visually inspecting output and plots. 3. Run any cleanup or tidying that was appropriate. 4. Provide visual representation, table or other output to communicate the contents.

Included below all the attributes and the summary views that were chosen to represent the data. Where cleanup or other manipulation was needed, this code is included and explained in more detail.

2.2 Dog Moves Dataset

location - character - The full name of the US state or country

head(dog_moves$location)
## [1] "Texas"          "Alabama"        "North Carolina" "South Carolina"
## [5] "Georgia"        "Puerto Rico"

exported - double - The number of adoptable dogs available in the US that originated in this location but were available for adoption in another location

summary(dog_moves$exported)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    1.00    3.00   27.33   16.50  635.00

imported - double - The number of adoptable dogs available in this state that originated in a different location

summary(dog_moves$imported)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00   27.33   15.50  390.00

total - double - The total number of adoptable dogs available in a given state.

summary(dog_moves$total)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0    66.5   645.9   939.5  4002.0

inUS - logical - Whether or not a location is in the US or not. Here, US territories will return FALSE

summary(dog_moves$inUS)
##    Mode   FALSE    TRUE 
## logical      39      51

2.3 Dog Travel Dataset

dim(dog_travel)
## [1] 6194    7

id - double - The unique PetFinder identification number for each animal

quantile(dog_travel$id)
##       0%      25%      50%      75%     100% 
##  8619716 44940096 45734990 45923182 46043149

contact_city - character - The rescue/shelter’s listed city

head(dog_travel$contact_city)
## [1] "Anoka"       "Groveland"   "Adamstown"   "Saint Cloud" "Pueblo"     
## [6] "Manchester"

contact_state - character - The rescue/shelter’s listed State

head(dog_travel$contact_state)
## [1] "MN" "FL" "MD" "MN" "CO" "CT"

description - character - The full description of each animal as entered by the rescue/shelter

head(dog_travel$description, 1)
## [1] "Boris is a handsome mini schnauzer who made his long trek up her from Arkansas on 4/2019. He loves rope toys and running around his foster mom's yard. \n\nHe is 4 years old, just under 10 pounds, and is a special needs dog. He needs vitamin B-12 shots every other week (may change), and needs prescription dog food with no meet or protein. Boris is on a vegan diet.\n\nIf interested in Boris please fill out or adoption application so we can set up a meet and greet for you! His foster mom can explain more in-depth about his needs and can answer any questions you may have about his lifestyle.\n\nBoris' adoption fee is $300 and the application can be found at http://www.aohrescue.org/dog-adoption-application"

found - character - Where the animal was found. Note: this is a mixed bag of values States/countries/cities

head(dog_travel$found, 5)
## [1] "Arkansas"    "Abacos"      "Adam"        "Adaptil"     "Afghanistan"

manual - character - not defined in the data dictionary REMOVED in a prior step

remove - logical - Animal removed from location

summary(dog_travel$remove)
##    Mode   FALSE    TRUE 
## logical    4456    1738

still_there - logical - TRUE/FALSE - Whether the animal is still located in their origin location and will be transported to their final destination after adoption.

summary(dog_travel$still_there)
##    Mode   FALSE    TRUE 
## logical    5875     319

2.4 dog_descriptions dataset

####Dataset dimensions

dim(dog_descriptions)
## [1] 58180    31

id - double - The unique PetFinder identification number for each animal.

quantile(dog_descriptions$id)
##       0%      25%      50%      75%     100% 
##   604115 44516882 45596624 45916674 46043149

org_id - character - The unique identification number for each shelter or rescue.

head(dog_descriptions$org_id, 5)
## [1] "NV163" "NV163" "NV99"  "NV202" "NV184"

url - character - The URL for each animal’s listing.

head(dog_descriptions$url, 5)
## [1] "https://www.petfinder.com/dog/harley-46042150/nv/las-vegas/animal-network-nv163/?referrer_id=87b31e7d-4508-41d1-95ff-fdb59b9d4669"              
## [2] "https://www.petfinder.com/dog/biggie-46042002/nv/las-vegas/animal-network-nv163/?referrer_id=87b31e7d-4508-41d1-95ff-fdb59b9d4669"              
## [3] "https://www.petfinder.com/dog/ziggy-46040898/nv/mesquite/city-of-mesquite-animal-shelter-nv99/?referrer_id=87b31e7d-4508-41d1-95ff-fdb59b9d4669"
## [4] "https://www.petfinder.com/dog/gypsy-46039877/nv/pahrump/pets-are-worth-saving-paws-nv202/?referrer_id=87b31e7d-4508-41d1-95ff-fdb59b9d4669"     
## [5] "https://www.petfinder.com/dog/theo-46039306/nv/henderson/wagging-tails-rescue-nv184/?referrer_id=87b31e7d-4508-41d1-95ff-fdb59b9d4669"

species - character - Species of animal. REMOVED in prior step (all rows are Dogs)

breed_primary - character - The primary (assumed) breed assigned by the shelter or rescue.

head(table(dog_descriptions$breed_primary))
## 
##    Affenpinscher     Afghan Hound Airedale Terrier           Akbash 
##               17                4               19                3 
##            Akita Alaskan Malamute 
##              181               72

breed_secondary - character - The secondary (assumed) breed assigned by the shelter or rescue.

head(table(dog_descriptions$breed_secondary))
## 
##    Affenpinscher     Afghan Hound Airedale Terrier           Akbash 
##               18                1                6               10 
##            Akita Alaskan Malamute 
##               28               13

breed_mixed - logical - Whether or not an animal is presumed to be mixed breed.

summary(dog_descriptions$breed_mixed)
##    Mode   FALSE    TRUE 
## logical   16589   41591

breed_unknown - logical - Whether or not the animal’s breed is completely unknown. REMOVED - These were all FALSE so attribute is removed

table(dog_descriptions$breed_unknown)
## 
## FALSE 
## 58180
dog_descriptions <- 
  dog_descriptions %>% 
  mutate(breed_unknown = NULL)

color_primary - character - The most prevalent color of an animal.

table(dog_descriptions$color_primary)
## 
##                  Apricot / Beige                          Bicolor 
##                             1322                             1415 
##                            Black                          Brindle 
##                             7639                             1953 
##                Brown / Chocolate                           Golden 
##                             2434                              817 
##             Gray / Blue / Silver                        Harlequin 
##                             1174                               73 
##                     Merle (Blue)                      Merle (Red) 
##                              186                              142 
##          Red / Chestnut / Orange                            Sable 
##                             1652                              159 
## Tricolor (Brown, Black, & White)                    White / Cream 
##                             2094                             3134 
##      Yellow / Tan / Blond / Fawn 
##                             1940

color_secondary - character - The second most prevalent color of an animal.

table(dog_descriptions$color_secondary)
## 
##                  Apricot / Beige                          Bicolor 
##                               45                               55 
##                            Black                          Brindle 
##                              497                              269 
##                Brown / Chocolate                           Golden 
##                              801                              184 
##             Gray / Blue / Silver                        Harlequin 
##                              296                               37 
##                     Merle (Blue)                      Merle (Red) 
##                               33                               29 
##          Red / Chestnut / Orange                            Sable 
##                              339                               34 
## Tricolor (Brown, Black, & White)                    White / Cream 
##                              124                             7285 
##      Yellow / Tan / Blond / Fawn 
##                             2031

color_tertiary - character - The third most prevalent color of an animal. REMOVED in a prior step

age - character - The assumed age class of an animal (Baby, Young, Adult, or Senior).

table(dog_descriptions$age)
## 
##  Adult   Baby Senior  Young 
##  27955   9397   4634  16194

sex - character - The sex of an animal (Female, Male, or Unknown).

table(dog_descriptions$sex)
## 
##  Female    Male Unknown 
##   27883   30294       3

size - character - The general size class of an animal (Small, Medium, Large, Extra Large).

table(dog_descriptions$size)
## 
## Extra Large       Large      Medium       Small 
##         931       15761       29908       11580

coat - character - Coat Length for each animal (Curly, Hairless, Long, Medium, Short, Wire).

table(dog_descriptions$coat)
## 
##    Curly Hairless     Long   Medium    Short     Wire 
##      135       16     1185     4927    20671      251
#barplot(table(dog_descriptions$coat))

fixed - logical - Whether or not an animal has been spayed/neutered.

table(dog_descriptions$fixed)
## 
## FALSE  TRUE 
## 11559 46621

house_trained - logical - Whether or not an animal is trained to not go to the bathroom in the house.

table(dog_descriptions$house_trained)
## 
## FALSE  TRUE 
## 37624 20556

declawed - logical - Whether or not the animal has had its dewclaws removed. REMOVED in a prior step all were FALSE

special_needs - logical - Whether or not the animal is considered to have special needs (this can be a long-term medical condition or particular temperament that requires extra care).

table(dog_descriptions$special_needs)
## 
## FALSE  TRUE 
## 56034  2146

shots_current - logical - Whether or not the animal is up to date on all of their vaccines and other shots.

table(dog_descriptions$shots_current)
## 
## FALSE  TRUE 
## 16112 42068

env_children - logical - Whether or not the animal is recommended for a home with children.

table(dog_descriptions$env_children)
## 
## FALSE  TRUE 
##  4439 23588

env_dogs - logical - Whether or not the animal is recommended for a home with other dogs.

table(dog_descriptions$env_dogs)
## 
## FALSE  TRUE 
##  3547 31122

env_cats - logical - Whether or not the animal is recommended for a home with cats.

table(dog_descriptions$env_cats)
## 
## FALSE  TRUE 
##  6810 12542

name - character - The animal’s name (as given by the shelter/rescue). This data was cleaned up a bit. There were Identification number and other extra information that for the analysis that we are looking to perform will be distracting and confusing to the end users. It might be necessary to do additional cleanup as we proceed. We are starting by getting the numbers that would be meaningless to the end user cleaned up.

The process is using regular expressions to update the values with just an id number to “Unknown”. Note: includes some notes Like “sponsored” “courtesy post” Id numbers of some kind

head(dog_descriptions$name, 20)
##  [1] "HARLEY"      "BIGGIE"      "Ziggy"       "Gypsy"       "Theo"       
##  [6] "Oliver"      "Macadamia"   "Dodger"      "Huckleberry" "Fagin"      
## [11] "Speckles"    "Cashew"      "Dash"        "Sydney"      "HENNA"      
## [16] "RUBO"        "LEGO"        "MARIE"       "RINGO"       "TONY"
dim(dog_descriptions)
## [1] 58180    30
dog_descriptions <- dog_descriptions %>% 
  mutate(name = str_to_title(name))

dog_descriptions <- dog_descriptions %>% 
  mutate(name = ifelse( str_detect(name, "^[0-9]*$"), "Unknown", name) ) %>% 
  mutate(name = ifelse( !str_detect(name, "[A-Z][A-z]*"), "Unknown", name) ) %>% 
  mutate(name = ifelse( str_detect(name, "^A[0-9]"), "Unknown", name) )

tags - character - Any tags given to the dog by the shelter rescue (pipe | separated). REMOVED in a prior step

photo - character - The URL to the animal’s primary photo. REMOVED in a prior step

status - character - Whether the animal is adoptable or not.

summary(dog_descriptions$status)
##    Length     Class      Mode 
##     58180 character character

posted - character - The date that this animal was first listed on PetFinder .

This single data-time attribute has been separated into separate attributes for each of the periods, posted_year, posted_month, …

dog_descriptions <-
  dog_descriptions %>% 
  separate(posted, c('posted_year', 'posted_month', 'posted_day', 'posted_hour', 'posted_min', 'posted_sec'), sep = "([- :])")


dog_descriptions %>% 
  select('posted_year', 'posted_month', 'posted_day', 'posted_hour', 'posted_min', 'posted_sec') %>% 
  head()
## # A tibble: 6 x 6
##   posted_year posted_month posted_day posted_hour posted_min posted_sec
##   <chr>       <chr>        <chr>      <chr>       <chr>      <chr>     
## 1 2019        09           20         16          37         59        
## 2 2019        09           20         16          24         57        
## 3 2019        09           20         14          10         11        
## 4 2019        09           20         10          08         22        
## 5 2019        09           20         06          48         30        
## 6 2019        09           20         06          43         59

contact_city - character - The rescue/shelter’s listed city.

Clean up: This field had different issues with cases. This effecting the way that the cities were grouped together.

The process used was a mapping tribble and then joining this to the original data and updating. Another way would have been to change the attribute to Title case throughout. This would have taken less code but would not ensure that all the modifications would have been approprtate as the menthod allows. For example Washington DC would be converted to Washington Dc if title case would have been used.

map <- tribble(
  ~contact_city, ~adj_city,
  "abingdon", "Abingdon",
  "lake charles", "Lake Charles",
  "JERSEY CITY", "Jersey City",
  "aitkin", "Aitkin",
  "cary", "Cary",
  "carson", "Carson",
  "BROOKLYN", "Brooklyn",
  "denver", "Denver",
  "grandville", "Grandville",
  "GREER","Greer",
  "BERLIN", "Berlin",
  "YERINGTON", "Yerington",
  "SACRAMENTO","Sacramento",
  "LINDSAY","Lindsay",
  "MONUMENT","Monument",
  "WHEAT RIDGE","Wheat Ridge",
  "NASSAU","Nassau",                
  "NEWARK","Newark",
  "MARION","Marion",
  "TROUTMAN","Troutman",
  "JIM THORPE","Jim Thorpe",
  "POCONO SUMMIT","Pocono Summit",
  "MARYSVILLE","Marysville",
  "reno", "Reno",
  "mount vernon","Mount Vernon",
  "modesto","Modesto",
  "ridgefield","Ridgefield",
  "taunton","Taunton",
  "garden city park","Garden City Park",
  "churchton","Churchton",
  "huntingtown","Huntingtown",
  "fayetteville","Fayetteville",
  "canton","Canton",
  "seattle","Seattle"
)

dog_descriptions <- 
  dog_descriptions %>% 
  left_join(map) %>% 
  mutate(contact_city = if_else(is.na(adj_city), contact_city, adj_city))
## Joining, by = "contact_city"
dog_descriptions %>% distinct(contact_city) %>% head()
## # A tibble: 6 x 1
##   contact_city 
##   <chr>        
## 1 Las Vegas    
## 2 Mesquite     
## 3 Pahrump      
## 4 Henderson    
## 5 Bullhead City
## 6 Kingman

Contact_state

contact_state - character - The rescue/shelter’s listed state.

Noted that there are numeric values in the contact_state attribute. These apprear to be zipcodes however are not likely to effect the analysis. So we have left them in for now.

table(dog_descriptions$contact_state)
## 
## 12220 12477 17325 19053 19063 20136 20905 23112 24588 37189 38506 45061 45249 
##     3     2     2     1     1     1     1     1     1     1     1     2     1 
## 46158 47131 47454 61944 70601 85249 87108 89146 98106    AK    AL    AR    AZ 
##     1     1     1     1     1     1     1     1     7    15  1428   695  2248 
##    CA    CO    CT    DC    DE    FL    GA    HI    IA    ID    IL    IN    KS 
##  1664  1773  1422   336   296  2659  3479    69   485    49  1114  1877   470 
##    KY    LA    MA    MD    ME    MI    MN    MO    MS    MT    NB    NC    ND 
##  1123   912   946  1493   545   673   958   920   510    18     2  2627    64 
##    NE    NH    NJ    NM    NV    NY    OH    OK    OR    PA    QC    RI    SC 
##   120   335  3022   636   857  4002  2670  1636    91  2821    14   607  1618 
##    SD    TN    TX    UT    VA    VT    WA    WI    WV    WY 
##    24  1769   566   485  3058   510  1277   542   565    52

contact_zip - character - The rescue/shelter’s listed zip code.

summary(dog_descriptions$contact_zip)
##    Length     Class      Mode 
##     58180 character character

contact_country - character - The rescue/shelter’s listed country. Noted that there are a few countries that would seem to have state codes entered. These seem to be related to the similar issue seen in the state attribute above.

table(dog_descriptions$contact_country)
## 
## 89009    AZ    CA    CT    DC    DE    IL    IN    KY    LA    NM    TN    US 
##     1     1    16     5     4     2     1     3     3     1     1     2 58131 
##    VA    WA 
##     2     7

stateQ - character - The state abbreviation queried in the API to return this result .

table(dog_descriptions$stateQ)
## 
## 2019-09-20      89009      89014      89024      89027      89121      89406 
##         33        746          1        284          2         25        181 
##      89408      89423      89431      89451      89801         AK         AL 
##          8        435         37        134          5         15       1043 
##         AR         AZ         CA         CO         CT         DC         DE 
##        695       2169       1028       1773       6725       4665       2365 
##         FL         GA         HI         IA         ID         IL         IN 
##       2659       3439         69        485         49       1118       1917 
##         KS         KY         LA         MA         MD         ME         MI 
##        470       1670        912        909         36        134        673 
##         MN         MO         MS         MT         NC         ND         NE 
##        957        920        510         18       3085         64        120 
##         NH         NJ         NM         NY         OH         OK         OR 
##        382       1863        636        669       1901       1636         91 
##         PA         RI         SC         SD         TN         TX         UT 
##        337          1       1137         24       2323        566        199 
##         VA         VT         WA         WI         WV         WY 
##       1165        118       1277        543        677         52

accessed - double - The date that this data was acquired from the PetFinder API.

This date was separated into is date components.

dog_descriptions <-
  dog_descriptions %>% separate(accessed, c('accessed_year', 'accessed_month', 'accessed_day' ), sep = '-') 
  
  
dog_descriptions %>% 
  select(accessed_year, accessed_month, accessed_day) %>% 
  head()
## # A tibble: 6 x 3
##   accessed_year accessed_month accessed_day
##   <chr>         <chr>          <chr>       
## 1 2019          09             20          
## 2 2019          09             20          
## 3 2019          09             20          
## 4 2019          09             20          
## 5 2019          09             20          
## 6 2019          09             20

type - character - The type of animal.

table(dog_descriptions$type)
## 
##   Dog 
## 57540

description - character - The full description of an animal, as entered by the rescue or shelter. This is the only field returned by the V1 API.

head(dog_descriptions$description,1)
## [1] "Harley is not sure how he wound up at shelter in his senior years but as you see from the pictures the shelter asked if we could find a real home for this active senior boy. You would never know he is 9 years old. Very playful and loves humans and all the dogs he has met at adoptions he seems to like. He is 59 lbs so still pretty strong for a senior but loves walks and you have to love his ears. If you would like to meet this happy go lucky boy please contact AdoptAnimalNetwork@gmail.com. Updated pictures this Sunday."

2.5 Combined data

We will combine dataset dog description and dog travelas a new data set and summarize the information to analyze or compare summary information from dataset dog move.

3. Exploratary Data Analysis

3.1 EDA by visualization

We will start EDA by exploring and visualizing the variables within each data sets.

3.1.1 Dog Moves Dataset EDA

We build interactive plots for the country a dog is in to provide information to the user.

Dog Location

library(crosstalk)
library(d3scatter)
shared_dogmoves <- SharedData$new(dog_moves)
bscols(widths = c(3,NA,NA),
       list(
         filter_checkbox("ctr", "In the US?", shared_dogmoves, ~inUS, inline = TRUE),
         filter_select("st", "Location Name", shared_dogmoves, ~location)
       ),
       d3scatter(shared_dogmoves, ~total, ~exported, ~factor(inUS), width="100%", height=250)
)
## Warning in bscols(widths = c(3, NA, NA), list(filter_checkbox("ctr", "In the
## US?", : Too many widths provided to bscols; truncating

Above we can see a plot comparing the total number of dogs plotted against the total number of exported dogs in our dataset with an interactive filter to view only dogs in the US vs. those outside of the US, as well as a dropdown location selector. We could incorporate a more advanced version of this inside of our shiny app to give insight as to the total number of dogs in a given location, as well as whether they are generally exported or imported.

In US

barplot(table(dog_moves$inUS), ylab = "Count of Dogs", main = "# of Dogs in vs. not in the US")

Exported Dogs

barplot(dog_moves$total ~ dog_moves$location, xlab = "location", ylab = "# of Exported Dogs", main = "# of Exported Dogs by Location", las = 2)

Imported Dogs

barplot(dog_moves$imported ~ dog_moves$location, xlab = "Location", ylab = "# of Dogs", main = "# of Imported Dogs by Location", las = 2)

Total Dogs

barplot(dog_moves$total ~ dog_moves$location, xlab = "Location", ylab = "# of Dogs", main = "Total # of Dogs by Location", las = 2)

3.1.2 Dog Travel Dataset EDA

We build word cloud on variabledescription as it’s mainly text and wishing to gain more information from the text mining.

We also intend to gain geographical visualization of travel summarized by contact state contact state

travel<- dog_travel%>%
  group_by(contact_state)%>%
  count()
travel<- is.numeric(travel$n)

#plot_usmap(data = travel, 
#           values = travel$n, 
#           include = c("AK", "AL", "AR", "AZ", "CA", #"CO","CT","DC","DE","FL","GA","HI","IA","ID","IL","IN","KS","KY#","LA","MA","MD","ME","MI","MN","MO","MS","MT","NB","NV","ND","#NE","NH","NJ","NM","NV","NY","OH","OK","OR","PA","QC","RI","SC"#,"SD","TN","TX","UT","VA","VT","WA","WI","WV","WY"), 
#           color = "red") + 
 # scale_fill_continuous(
#    low = "white", high = "red", name = "travel contact state", #label = scales::comma) + 
#  labs(title = "US States", subtitle = "Travel group by #information") 

#plot_usmap(data = dog_travel, values = table(contact_state), color = "red") + 
#  scale_fill_continuous(name = "Contact State", 
#                        label = scales::comma) + 
#  theme(legend.position = "right")

Have problem generating the map. Later we could also do similar with found.

description This variable contents only text, so generated a word cloud to see the most frequent word.

# Load the data as a corpus
docs <- Corpus(VectorSource(dog_travel$description))
#inspect the content of docs
inspect(docs)
#text transformation
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "@")
docs <- tm_map(docs, toSpace, "\\|")
docs <- tm_map(docs, toSpace, "~")
docs <- tm_map(docs, toSpace, "~~")
#text cleaning
# Convert the text to lower case
docs <- tm_map(docs, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(docs, content_transformer(tolower)):
## transformation drops documents
# Remove numbers
docs <- tm_map(docs, removeNumbers)
## Warning in tm_map.SimpleCorpus(docs, removeNumbers): transformation drops
## documents
# Remove english common stopwords
docs <- tm_map(docs, removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(docs, removeWords, stopwords("english")):
## transformation drops documents
# Remove your own stop word
# specify your stopwords as a character vector
docs <- tm_map(docs, removeWords, c("blabla1", "blabla2")) 
## Warning in tm_map.SimpleCorpus(docs, removeWords, c("blabla1", "blabla2")):
## transformation drops documents
# Remove punctuations
docs <- tm_map(docs, removePunctuation)
## Warning in tm_map.SimpleCorpus(docs, removePunctuation): transformation drops
## documents
# Eliminate extra white spaces
docs <- tm_map(docs, stripWhitespace)
## Warning in tm_map.SimpleCorpus(docs, stripWhitespace): transformation drops
## documents
# Text stemming
# docs <- tm_map(docs, stemDocument)
#build text- document matrix
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 1)
set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

found

barplot(table(dog_travel$found))

Maybe we can do parallel matching with contact city/ state later, or map by state and do comparision. At present, we can not achieve this as the variable is described asAlabamainstead of AL. The work need to be done is: word matching, if it’s state in U.S., we will match it to contact state, which is two letter form; if it’s city data, we will match it to contact city; if it’s not in the U.S., we will list the country and combine with dog_moves dataset later.

remove

barplot(table(dog_travel$remove))

still_there

barplot(table(dog_travel$still_there))

3.1.3 Dog Descriptions Dataset EDA

We intend to used bar chart/ pie chart on categorical variables eg, breed, color, sex. We also intend to use parallel coordinates on variables foundand contact state/city.

On top of that, we intend to gain new information by slicing and grouping by the state, we can get more summarized information by state and then look into information within each state.

par(mfrow=c(2,2))
barplot(table(dog_descriptions$breed_primary))
barplot(table(dog_descriptions$sex))
barplot(table(dog_descriptions$age))
barplot(table(dog_descriptions$color_primary))

** color primary**

# Increase margin size
par(mar=c(13,4,4,4))
barplot(table(dog_descriptions$color_secondary),las=2)

color secondary

# Increase margin size
par(mar=c(13,4,4,4))
barplot(table(dog_descriptions$color_secondary),las=2)

coat

barplot(table(dog_descriptions$coat))

breed_df<- dog_descriptions%>%
  group_by(contact_state,breed_primary)%>%
  count()%>%
  arrange(contact_state, desc(n))

#table(breed_df$contact_state)

we also intend to visualize different categories by state.

#plot_usmap(
#    data = breed_df, values = n, include = c("AK", "AL", "AR", #"AZ", "CA", "CO","CT","DC","DE","FL","GA","HI","IA","ID","IL","#IN","KS","KY","LA","MA","MD","ME","MI","MN","MO","MS","MT","NB"#,"NV","ND","NE","NH","NJ","NM","NV","NY","OH","OK","OR","PA","Q#C","RI","SC","SD","TN","TX","UT","VA","VT","WA","WI","WV","WY")#, color = "red"
#  ) + 
#  scale_fill_continuous(
#    low = "white", high = "red", name = "Breed in", label = #scales::comma
 # ) + 
#  labs(title = "Western US States", subtitle = "These are the #states in the Pacific Timezone.") 

We still need to fix the code on this one.

3.2 Shiny Proposal

From the exploratory data analysis by visualization above. We want to create more explicit and interactive visualization, which could not only help our analysis but more user-friendly.

We want to build an interactive visualization using shiny. We will be working on all three data sets. We will be able to let user to get more generalized information by state. Additionally, when a certain state is selected, more detailed information about dogs in the state could display. For example, the imported/ exported dogs and their names, breeds, origins, age, etc.

{r shiny, echo = FALSE} selectInput(“data”, "“, c(”co2“,”lh"))

See a plot:
``{r echo = FALSE}
renderPlot({
  d <- get(input$data)
plot(d) })

3.4 Logisitic Analysis Proposal

We are trying to do logistic regression analysis on the combined data of dog description and dog moves. We will split the dataset into train and test data set.

What we need right now that are unachievable is effective text mining, as we find some information like adoption fee and tranfer fee are contained within the description variable of dog description dataset. They could play a role in the model as well.

3.5 Frequent Subsequence Mining Proposal

Looking into the word cloud we generated, the most frequent words that have been mentioned are not very useful for our adoption decision evaluation. We could further explore the presence of certain types of words in certain types of dogs to recommend similar records to a user interacting with our Shiny app.

We decide to try with Frequent itemset mining (FPA).

Back to top

Appendix: Dictionary

Variable Class description
location character The full name of the US state or country
exported double The number of adoptable dogs available in the US that originated in this location but were available for adoption in another location
imported double The number of adoptable dogs available in this state that originated in a different location
total double The total number of adoptable dogs availabe in a given state.
inUS logical Whether or not a location is in the US or not. Here, US territories will return FALSE
Variable Class description
id double The unique PetFinder identification number for each animal
contact_city character The rescue/shelter’s listed city
contact_state character The rescue/shelter’s listed State
description character The full description of each animal as entered by the rescue/shelter
found character Where the animal was found.
manual character .
remove logical Animal removed from location
still_there logical TRUE/FALSE - Whether the animal is still located in their origin location and will be transported to their final destination after adoption.
Variable Class description
id double The unique PetFinder identification number for each animal.
org_id character The unique identification number for each shelter or rescue.
url character The URL for each animal’s listing.
species character Species of animal.
breed_primary character The primary (assumed) breed assigned by the shelter or rescue.
breed_secondary character The secondary (assumed) breed assigned by the shelter or rescue.
breed_mixed logical Whether or not an animal is presumed to be mixed breed.
breed_unknown logical Whether or not the animal’s breed is completely unknown.
color_primary character The most prevalent color of an animal.
color_secondary character The second most prevalent color of an animal.
color_tertiary character The third most prevalent color of an animal.
age character The assumed age class of an animal (Baby, Young, Adult, or Senior).
sex character The sex of an animal (Female, Male, or Unknown).
size character The general size class of an animal (Small, Medium, Large, Extra Large).
coat character Coat Length for each animal (Curly, Hairless, Long, Medium, Short, Wire).
fixed logical Whether or not an animal has been spayed/neutered.
house_trained logical Whether or not an animal is trained to not go to the bathroom in the house.
declawed logical Whether or not the animal has had its dewclaws removed.
special_needs logical Whether or not the animal is considered to have special needs (this can be a long-term medical condition or particular temperament that requires extra care).
shots_current logical Whether or not the animal is up to date on all of their vaccines and other shots.
env_children logical Whether or not the animal is recommended for a home with children.
env_dogs logical Whether or not the animal is recommended for a home with other dogs.
env_cats logical Whether or not the animal is recommended for a home with cats.
name character The animal’s name (as given by the shelter/rescue).
tags character Any tags given to the dog by the shelter rescue (pipe
photo character The URL to the animal’s primary photo.
status character Whether the animal is adoptable or not.
posted character The date that this animal was first listed on PetFinder .
contact_city character The rescue/shelter’s listed city.
contact_state character The rescue/shelter’s listed state.
contact_zip character The rescue/shelter’s listed zip code.
contact_country character The rescue/shelter’s listed country.
stateQ character The state abbreviation queried in the API to return this result .
accessed double The date that this data was acquired from the PetFinder API.
type character The type of animal.
description character The full description of an animal, as entered by the rescue or shelter. This is the only field returned by the V1 API.