1 Introduction

1.1 Problem Statement

Approximately 3.1 million dogs are sheltered annually in the United States. At these shelters, about 1.2 million dogs are put down annually in the United States. Even the dogs which are not put down are often scared and confused, which makes them less adoptable due to their behavior during visits. However, there may be a way to determine the adoptability of canines, which shelters could potentially use to optimize their adoption process. This would result in more shelter dogs being homed and not being re-sheltered. The first step is to understand what makes specific dogs “adoptable” or not.

The data used for investigating the adaptability of shelter dogs is that of The Pudding’s Amber Thomas and Sacha Maxim. The method that the team plans to use is by cleaning, exploring, and analyzing said data such that correlations and patterns can be illustrated which may explain the trends by which Americans adopt dogs.

Our current proposed approach/analytic technique which we believe will help to address this problem is Frequent Subsequence Mining or Subset Mining. By filtering for certain traits of dogs and observing if these traits possibly contribute to their eventual adoption. For example, do Americans prefer to adopt dogs which can be bred (like those who have not been spayed or neutered), or would a dog be more likely to be adopted if there was no chance for unwanted pregnancy? Does location affect this? Perhaps there are more dog breeders in southern states as opposed to northern states. Which breeds of dogs are more likely to be adopted, and are adopters willing to import or export their pets to get the specific breed that they want? By wrangling this data and exploring these factors, perhaps a clearer understanding of a dog’s adaptability can be gleaned.

This analysis will help shelters enhance their adoption process to target certain American demographics with different desires for the dogs they choose to adopt. This way, adoption can be facilitated because the shelters can either make their dogs more adoptable by declawing, grooming, training, sterilizing, importing, exporting, etc. Improved adoptability of dogs will then theoretically lead to increased rates of adoption, which will decrease overall suffering of sheltered dogs.

1.2 Packages Required

The package which was used in this analysis and is therefore required to replicate it is tidyverse. We chose to use this package because it is comprised of several other packages, including: ggplot2 (for creating graphics), dplyr (for data manipulation), tidyr (to help clean the data), readr (to read rectangular data), tibble (to modify data frames), etc. Our team used the tidyverse package because we wanted to utilize the following commands which are available in the tidyverse package: readr::read_csv, tibble(), mutate(), etc.

We also used the hwardcloud and wordcloud2 packages to generate word clouds. Additionally, package tm was used for text mining, package SnowballC was used for text stemming, and package RColorBrewer was used for access to different color palettes. We use the devtools github package, including the libraries d3scatter and crosstalk to generate interactive plots.

1.3 load package and write functions to avoid iteration

## Installing package into 'C:/Users/John Trygier/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)

## package 'tm' successfully unpacked and MD5 sums checked

## Warning: cannot remove prior installation of package 'tm'

## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying C:
## \Users\John Trygier\Documents\R\win-library\4.0\00LOCK\tm\libs\x64\tm.dll to C:
## \Users\John Trygier\Documents\R\win-library\4.0\tm\libs\x64\tm.dll: Permission
## denied

## Warning: restored 'tm'

## 
## The downloaded binary packages are in
##  C:\Users\John Trygier\AppData\Local\Temp\RtmpGS3LpP\downloaded_packages

## Installing package into 'C:/Users/John Trygier/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)

## package 'SnowballC' successfully unpacked and MD5 sums checked

## Warning: cannot remove prior installation of package 'SnowballC'

## Warning in file.copy(savedcopy, lib, recursive = TRUE):
## problem copying C:\Users\John Trygier\Documents\R\win-
## library\4.0\00LOCK\SnowballC\libs\x64\SnowballC.dll to C:\Users\John
## Trygier\Documents\R\win-library\4.0\SnowballC\libs\x64\SnowballC.dll: Permission
## denied

## Warning: restored 'SnowballC'

## 
## The downloaded binary packages are in
##  C:\Users\John Trygier\AppData\Local\Temp\RtmpGS3LpP\downloaded_packages

## Installing package into 'C:/Users/John Trygier/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)

## package 'wordcloud' successfully unpacked and MD5 sums checked

## Warning: cannot remove prior installation of package 'wordcloud'

## Warning in file.copy(savedcopy, lib, recursive = TRUE):
## problem copying C:\Users\John Trygier\Documents\R\win-
## library\4.0\00LOCK\wordcloud\libs\x64\wordcloud.dll to C:\Users\John
## Trygier\Documents\R\win-library\4.0\wordcloud\libs\x64\wordcloud.dll: Permission
## denied

## Warning: restored 'wordcloud'

## 
## The downloaded binary packages are in
##  C:\Users\John Trygier\AppData\Local\Temp\RtmpGS3LpP\downloaded_packages

## Installing package into 'C:/Users/John Trygier/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)

## package 'RColorBrewer' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\John Trygier\AppData\Local\Temp\RtmpGS3LpP\downloaded_packages

## Installing package into 'C:/Users/John Trygier/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)

## Warning: package 'd3scatter' is not available (for R version 4.0.2)

## Installing package into 'C:/Users/John Trygier/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)

## 
##   There is a binary version available but the source version is later:
##          binary source needs_compilation
## devtools  2.4.2  2.4.3             FALSE

## installing the source package 'devtools'

## Skipping install of 'crosstalk' from a github remote, the SHA1 (8128ef3b) has not changed since last install.
##   Use `force = TRUE` to force installation

## Skipping install of 'd3scatter' from a github remote, the SHA1 (aba6687f) has not changed since last install.
##   Use `force = TRUE` to force installation

## Installing package into 'C:/Users/John Trygier/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)

## package 'lubridate' successfully unpacked and MD5 sums checked

## Warning: cannot remove prior installation of package 'lubridate'

## Warning in file.copy(savedcopy, lib, recursive = TRUE):
## problem copying C:\Users\John Trygier\Documents\R\win-
## library\4.0\00LOCK\lubridate\libs\x64\lubridate.dll to C:\Users\John
## Trygier\Documents\R\win-library\4.0\lubridate\libs\x64\lubridate.dll: Permission
## denied

## Warning: restored 'lubridate'

## 
## The downloaded binary packages are in
##  C:\Users\John Trygier\AppData\Local\Temp\RtmpGS3LpP\downloaded_packages

## Installing package into 'C:/Users/John Trygier/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)

## 
##   There is a binary version available but the source version is later:
##      binary source needs_compilation
## glue  1.5.0  1.5.1              TRUE
## 
##   Binaries will be installed
## package 'glue' successfully unpacked and MD5 sums checked

## Warning: cannot remove prior installation of package 'glue'

## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying C:
## \Users\John Trygier\Documents\R\win-library\4.0\00LOCK\glue\libs\x64\glue.dll
## to C:\Users\John Trygier\Documents\R\win-library\4.0\glue\libs\x64\glue.dll:
## Permission denied

## Warning: restored 'glue'

## 
## The downloaded binary packages are in
##  C:\Users\John Trygier\AppData\Local\Temp\RtmpGS3LpP\downloaded_packages

## Installing package into 'C:/Users/John Trygier/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)

## package 'gridExtra' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\John Trygier\AppData\Local\Temp\RtmpGS3LpP\downloaded_packages

## Installing package into 'C:/Users/John Trygier/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)

## package 'lattice' successfully unpacked and MD5 sums checked

## Warning: cannot remove prior installation of package 'lattice'

## Warning in file.copy(savedcopy, lib, recursive = TRUE):
## problem copying C:\Users\John Trygier\Documents\R\win-
## library\4.0\00LOCK\lattice\libs\x64\lattice.dll to C:\Users\John
## Trygier\Documents\R\win-library\4.0\lattice\libs\x64\lattice.dll: Permission
## denied

## Warning: restored 'lattice'

## 
## The downloaded binary packages are in
##  C:\Users\John Trygier\AppData\Local\Temp\RtmpGS3LpP\downloaded_packages

## Installing package into 'C:/Users/John Trygier/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)

## package 'leaflet' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\John Trygier\AppData\Local\Temp\RtmpGS3LpP\downloaded_packages

## Installing package into 'C:/Users/John Trygier/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)

## package 'formattable' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\John Trygier\AppData\Local\Temp\RtmpGS3LpP\downloaded_packages

## Installing package into 'C:/Users/John Trygier/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)

## package 'geojsonio' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\John Trygier\AppData\Local\Temp\RtmpGS3LpP\downloaded_packages

## Installing package into 'C:/Users/John Trygier/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)

## Warning in install.packages("rgeos", type = "source", repos = "http://cran.us.r-
## project.org"): installation of package 'rgeos' had non-zero exit status

## Installing package into 'C:/Users/John Trygier/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)

## Warning in install.packages("rgdal", type = "source", repos = "http://cran.us.r-
## project.org"): installation of package 'rgdal' had non-zero exit status

## Installing package into 'C:/Users/John Trygier/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)

## package 'RColorBrewer' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\John Trygier\AppData\Local\Temp\RtmpGS3LpP\downloaded_packages

## Installing package into 'C:/Users/John Trygier/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)

## package 'viridis' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\John Trygier\AppData\Local\Temp\RtmpGS3LpP\downloaded_packages

## Installing package into 'C:/Users/John Trygier/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)

## package 'zipcodeR' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\John Trygier\AppData\Local\Temp\RtmpGS3LpP\downloaded_packages

## Installing package into 'C:/Users/John Trygier/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)

## package 'ggmap' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\John Trygier\AppData\Local\Temp\RtmpGS3LpP\downloaded_packages

# Load Libraries
library(tidyverse)
library(dplyr)
library(tidyr)
library(readr)
library(tibble)
library(broom)
library(ggplot2)
library(gridExtra)
library(grid)
library(lattice)
library(formattable)
library(zipcodeR)
library(ggmap)
library(geojsonio)
library(RColorBrewer)
library(rgdal)
library(rgeos)
library(viridis) 

library(usmap)
library(tm)
library(SnowballC)
library(wordcloud)
library(RColorBrewer)
library(leaflet)

library(crosstalk)
library(d3scatter)
library(lubridate)
library(glue)

Create 2 variables named state_fulnameand state_abrv representing full name and abbreviation of 51 states to avoid duplication.

state_fulname = c('Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware', 'Florida', 'Georgia','Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts','Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey','New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island','South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington', 'West Virginia','Wisconsin', 'Wyoming', 'Washington DC')

state_abrv= c('AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'FL', 'GA', 'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME','MD', 'MA', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY', 'NC', 'ND', 'OH', 'OK', 'OR', 'PA','RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY', 'DC')

Create a Function named freq_count to avoid iteration. The function has argument of df(the data frame), group_var(the variable we want to group_by) and n(top n biggest frequency count numbers).

freq_count <- function(df, group_var, n) {
  require(dplyr)
  group_var <- enquo(group_var) # need to quote
  #summary_var <- enquo(summary_var)
  df %>%
    group_by(!!group_var) %>% # !! unquotes
    summarise(freq_sum = n())%>%
    arrange(desc(freq_sum))%>%
    top_n(n=n)
}

2 Data Cleaning

2.1 Data Source and Explore

Data for this project comes from Adoptable Dogs on the R for Data Science Github repository. here

This data was originally used in a project by Amber Thomas and designed by Sacha Maxim. This data was included with their article, Finding Forever Homes and uses this adoptable dogs data from 09-20-2019 from Petfinder.com.

Included are three data sets. Two of which can be joined and at third that is a summary data set.

Note: The original datadog_descriptions.csv have data entry mistakes. Thus, we corrected on the data entries and load the new data from our personal github repository.

Load Data and library

dog_moves <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-12-17/dog_moves.csv')
dog_travel <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-12-17/dog_travel.csv')

dogtraveltry<- readr::read_csv("https://raw.githubusercontent.com/the-pudding/data/master/dog-shelters/dogTravel.csv")

#original data:
#dog_descriptions <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-12-17/dog_descriptions.csv')

#new dog_descriptions
dog_descriptions<-readr::read_csv('https://raw.githubusercontent.com/njuteresa2019/MS-BA-data-wrangling/main/dog_descriptions.csv')

Remove Missing Values: Move dataset

summary(dog_moves)

##    location            exported         imported          total       
##  Length:90          Min.   :  1.00   Min.   :  1.00   Min.   :  15.0  
##  Class :character   1st Qu.:  1.00   1st Qu.: 11.25   1st Qu.: 477.5  
##  Mode  :character   Median :  4.00   Median : 24.50   Median : 857.0  
##                     Mean   : 30.37   Mean   : 64.74   Mean   :1139.8  
##                     3rd Qu.: 18.00   3rd Qu.: 73.25   3rd Qu.:1650.0  
##                     Max.   :635.00   Max.   :390.00   Max.   :4002.0  
##                     NA's   :9        NA's   :52       NA's   :39      
##     inUS        
##  Mode :logical  
##  FALSE:39       
##  TRUE :51       
##                 
##                 
##                 
##

x <- dog_moves[is.na(dog_moves$total),]
print(tibble(x), n=10)

## # A tibble: 39 x 5
##    location    exported imported total inUS 
##    <chr>          <dbl>    <dbl> <dbl> <lgl>
##  1 Puerto Rico      131       NA    NA FALSE
##  2 South Korea       76       NA    NA FALSE
##  3 Mexico            54       NA    NA FALSE
##  4 China             28       NA    NA FALSE
##  5 Thailand          20       NA    NA FALSE
##  6 India             18       NA    NA FALSE
##  7 Taiwan            18       NA    NA FALSE
##  8 Egypt             15       NA    NA FALSE
##  9 Bahamas           13       NA    NA FALSE
## 10 Spain              8       NA    NA FALSE
## # ... with 29 more rows

Treatment of missing values: Travel and Description dataset

As per description The number of adoptable dogs available in the US that originated in this location but were available for adoption in another location and vice versa for imported dogs. Total is simply total dogs available. For imported and exported we can replace missing values with 0, although as we don’t want to lose entire rows and possibly it might actually be 0 as well. Apart from Indianapolis rest of the values are foreign countries for total so it does not matter much as we are not focused on total dogs available for adoption outside the US. It will simply be transformed into 0. Logically and for our analysis, this would not have an impact.

dog_moves$total[is.na(dog_moves$total)] <- 0
dog_moves$imported[is.na(dog_moves$imported)] <- 0
dog_moves$exported[is.na(dog_moves$exported)] <- 0

We have missing values for remove and still_there. We might not have this data collected but it does have to exist Animal removed from location is for remove. still_there is Whether the animal is still located in their origin location and will be transported to their final destination after adoption. Ideally interaction with source of data would clarify this. For remove we can assume NA is false which will imply animal is still present at location. Similarly for still there.These movements are for dogs in transit for adoption and anything else we can assume the large chuck of dogs are not moving and hence NA turned into FALSE.

dog_travel$remove[is.na(dog_travel$remove)] <- FALSE
dog_travel$still_there[is.na(dog_travel$still_there)] <- FALSE
summary(dog_travel)

##        id           contact_city       contact_state      description       
##  Min.   : 8619716   Length:6194        Length:6194        Length:6194       
##  1st Qu.:44940096   Class :character   Class :character   Class :character  
##  Median :45734990   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :44786280                                                           
##  3rd Qu.:45923182                                                           
##  Max.   :46043149                                                           
##     found              manual            remove        still_there    
##  Length:6194        Length:6194        Mode :logical   Mode :logical  
##  Class :character   Class :character   FALSE:4456      FALSE:5875     
##  Mode  :character   Mode  :character   TRUE :1738      TRUE :319      
##                                                                       
##                                                                       
##

Declawed, color_tertiary, photo, and tags are removed as nearly all values here are missing. Species is dog for every row so it adds no additional value.

dog_descriptions <- subset(dog_descriptions,select = -c(declawed,species,color_tertiary ,photo ,tags))

Checking for counts of NA’s by attribute

Remaining missing values are all in character variables, the NA values can be replaced by UNKNOWN character string for models which don’t tolerate missing values.

NA counts in dog_description dataset

colSums(is.na(dog_descriptions))

##              id          org_id             url   breed_primary breed_secondary 
##               0               0               0               0           37359 
##     breed_mixed   breed_unknown   color_primary color_secondary             age 
##               0               0           32046           46121               0 
##             sex            size            coat           fixed   house_trained 
##               0               0           30995               0               0 
##   special_needs   shots_current    env_children        env_dogs        env_cats 
##               0               0           30153           23511           38828 
##            name          status          posted    contact_city   contact_state 
##               0               0               0               0               0 
##     contact_zip contact_country          stateQ        accessed            type 
##              12               0               0               0             640 
##     description 
##            8705

NA counts in dog_travel dataset

colSums(is.na(dog_travel))

##            id  contact_city contact_state   description         found 
##             0             0             0             0             0 
##        manual        remove   still_there 
##          4047             0             0

NA counts in dog_moves dataset

colSums(is.na(dog_moves))

## location exported imported    total     inUS 
##        0        0        0        0        0

Evaluation of the data types and values

In this section we evaluate the individual variables. Looking at the datatype and values in the attribute.

Evaluation for:

Is the datatype appropriate?
Are there data that will disrupt the analysis, including outliers?

The Process:

Run summary on the attribute.
Evaluate that data. Running additional reports and plots to inspect the data including but not limited to tidy queries, visually inspecting output and plots.
Run any cleanup or tidying that was appropriate.
Provide visual representation, table or other output to communicate the contents.

Included below all the attributes and the summary views that were chosen to represent the data. Where cleanup or other manipulation was needed, this code is included and explained in more detail.

2.2 Dog Moves Dataset

This dataset was derived from Dog Descriptions.csv and Dog Travel.csv file to find the total numbers of imports and exports for each location. This data represents a single day of data. It was all collected on September 20, 2019. The script to process the file can be found here.

dim(dog_moves)

## [1] 90  5

head(dog_moves)

## # A tibble: 6 x 5
##   location       exported imported total inUS 
##   <chr>             <dbl>    <dbl> <dbl> <lgl>
## 1 Texas               635        0   566 TRUE 
## 2 Alabama             268        2  1428 TRUE 
## 3 North Carolina      158       14  2627 TRUE 
## 4 South Carolina      139       12  1618 TRUE 
## 5 Georgia             137       19  3479 TRUE 
## 6 Puerto Rico         131        0     0 FALSE

summary(dog_moves)

##    location            exported         imported          total       
##  Length:90          Min.   :  0.00   Min.   :  0.00   Min.   :   0.0  
##  Class :character   1st Qu.:  1.00   1st Qu.:  0.00   1st Qu.:   0.0  
##  Mode  :character   Median :  3.00   Median :  0.00   Median :  66.5  
##                     Mean   : 27.33   Mean   : 27.33   Mean   : 645.9  
##                     3rd Qu.: 16.50   3rd Qu.: 15.50   3rd Qu.: 939.5  
##                     Max.   :635.00   Max.   :390.00   Max.   :4002.0  
##     inUS        
##  Mode :logical  
##  FALSE:39       
##  TRUE :51       
##                 
##                 
##

location - character - The full name of the US state or country.

freq_count(dog_moves,location,5)

## # A tibble: 90 x 2
##    location    freq_sum
##    <chr>          <int>
##  1 Afghanistan        1
##  2 Alabama            1
##  3 Alaska             1
##  4 Arizona            1
##  5 Arkansas           1
##  6 Aruba              1
##  7 Azerbaijan         1
##  8 Bahamas            1
##  9 Bahrain            1
## 10 Bosnia             1
## # ... with 80 more rows

There are 90 rows in this dataset. Each row represents a specific US state or country.

exported - double - The number of adoptable dogs available in the US that originated in this location but were available for adoption in another location

imported - double - The number of adoptable dogs available in this state that originated in a different location

total - double - The total number of adoptable dogs available in a given state.

inUS - logical - Whether or not a location is in the US or not. Here, US territories will return FALSE

freq_count(dog_moves,inUS,5)

## # A tibble: 2 x 2
##   inUS  freq_sum
##   <lgl>    <int>
## 1 TRUE        51
## 2 FALSE       39

There are 51 observations that are in the U.S. and 39 from other countries. We can create a new variable country and fill with “U.S.” and other country names.

dog_moves<- dog_moves%>%
  mutate(country= ifelse(inUS=="TRUE","US",location))

2.3 Dog Travel Dataset

This dataset was derived from Dog Descriptions.csv. This file only includes information on dogs whose description indicates that they did not originate in the state where they were made available for adoption. This file aims to show where those dogs are available and where they came from. This data represents a single day of data. It was all collected on September 20, 2019. The script to process the file can be found here.

dim(dog_travel)

## [1] 6194    8

head(dog_travel)

## # A tibble: 6 x 8
##         id contact_city contact_state description found manual remove still_there
##      <dbl> <chr>        <chr>         <chr>       <chr> <chr>  <lgl>  <lgl>      
## 1 44520267 Anoka        MN            "Boris is ~ Arka~ <NA>   FALSE  FALSE      
## 2 44698509 Groveland    FL            "Duke is a~ Abac~ Baham~ FALSE  FALSE      
## 3 45983838 Adamstown    MD            "Zac Woof-~ Adam  Maryl~ FALSE  FALSE      
## 4 44475904 Saint Cloud  MN            "~~Came in~ Adap~ <NA>   TRUE   FALSE      
## 5 43877389 Pueblo       CO            "Palang is~ Afgh~ <NA>   FALSE  FALSE      
## 6 43082511 Manchester   CT            "Brooke ha~ Afgh~ <NA>   FALSE  FALSE

summary(dog_travel)

##        id           contact_city       contact_state      description       
##  Min.   : 8619716   Length:6194        Length:6194        Length:6194       
##  1st Qu.:44940096   Class :character   Class :character   Class :character  
##  Median :45734990   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :44786280                                                           
##  3rd Qu.:45923182                                                           
##  Max.   :46043149                                                           
##     found              manual            remove        still_there    
##  Length:6194        Length:6194        Mode :logical   Mode :logical  
##  Class :character   Class :character   FALSE:4456      FALSE:5875     
##  Mode  :character   Mode  :character   TRUE :1738      TRUE :319      
##                                                                       
##                                                                       
##

id - double - The unique PetFinder identification number for each animal

contact_city - character - The rescue/shelter’s listed city. Cleaning: change zipcode=17325 to it’s city name Gettysburg.

#table(dog_travel$contact_city)
dog_travel%>%
  filter(nchar(contact_city)<3)

## # A tibble: 15 x 8
##          id contact_city contact_state description found manual remove still_there
##       <dbl> <chr>        <chr>         <chr>       <chr> <chr>  <lgl>  <lgl>      
##  1 45362806 OH           OH            "Liberty c~ Cinc~ Ohio   FALSE  FALSE      
##  2 36978896 PA           PA            "Maddie is~ Mary~ <NA>   TRUE   FALSE      
##  3 33218331 PA           PA            "Born in A~ Mary~ <NA>   TRUE   FALSE      
##  4 36978896 PA           PA            "Maddie is~ New ~ <NA>   TRUE   FALSE      
##  5 33218331 PA           PA            "Born in A~ New ~ <NA>   TRUE   FALSE      
##  6 36978896 PA           17325         "Maddie is~ New ~ <NA>   TRUE   FALSE      
##  7 33218331 PA           17325         "Born in A~ New ~ <NA>   TRUE   FALSE      
##  8 36978896 PA           17325         "Maddie is~ Penn~ <NA>   TRUE   FALSE      
##  9 33218331 PA           17325         "Born in A~ Penn~ <NA>   TRUE   FALSE      
## 10 36978896 PA           17325         "Maddie is~ Virg~ <NA>   TRUE   FALSE      
## 11 33218331 PA           17325         "Born in A~ Virg~ <NA>   TRUE   FALSE      
## 12 36978896 PA           17325         "Maddie is~ Wash~ <NA>   TRUE   FALSE      
## 13 33218331 PA           17325         "Born in A~ Wash~ <NA>   TRUE   FALSE      
## 14 36978896 PA           17325         "Maddie is~ West~ <NA>   TRUE   FALSE      
## 15 33218331 PA           17325         "Born in A~ West~ <NA>   TRUE   FALSE

#cleaning
dog_travel<- dog_travel%>%
  mutate(contact_city=ifelse(contact_state=="17325","Gettysburg",contact_city))

contact_state - character - The rescue/shelter’s listed State. Cleaning: change the zipcode to the state code name.

table(dog_travel$contact_state)

## 
## 17325    AL    AR    AZ    CA    CO    CT    DC    DE    FL    GA    IA    IL 
##    10    75    10    70    28   103    90   112    57   133   109    22    37 
##    IN    KS    KY    LA    MA    MD    ME    MI    MN    MO    MS    NB    NC 
##    65     3    31     5   107   379   361    67   190    42     6     2   187 
##    NH    NJ    NM    NV    NY    OH    OK    OR    PA    RI    SC    TN    TX 
##    44   552    82    20   490   177    31    32   316   130    21   112     2 
##    UT    VA    VT    WA    WI    WV 
##    66  1025    49   634    83    27

#cleaning
dog_travel$contact_state[dog_travel$contact_state=="17325"]<-'PA'

description - character - The full description of each animal as entered by the rescue/shelter

head(dog_travel$description, 1)

## [1] "Boris is a handsome mini schnauzer who made his long trek up her from Arkansas on 4/2019. He loves rope toys and running around his foster mom's yard. \n\nHe is 4 years old, just under 10 pounds, and is a special needs dog. He needs vitamin B-12 shots every other week (may change), and needs prescription dog food with no meet or protein. Boris is on a vegan diet.\n\nIf interested in Boris please fill out or adoption application so we can set up a meet and greet for you! His foster mom can explain more in-depth about his needs and can answer any questions you may have about his lifestyle.\n\nBoris' adoption fee is $300 and the application can be found at http://www.aohrescue.org/dog-adoption-application"

found - character - Where the animal was found. Note: this is a mixed bag of values States/countries/cities

table(dog_travel$found)

The entries looks abnormal:

dog_travel%>%
  filter(found%in%c("Beijing","Ark","Ark.","Buddy","Chihuahua"))

## # A tibble: 11 x 8
##          id contact_city contact_state description found manual remove still_there
##       <dbl> <chr>        <chr>         <chr>       <chr> <chr>  <lgl>  <lgl>      
##  1 44182772 Crescent Ci~ FL            "Go to www~ Ark   <NA>   TRUE   FALSE      
##  2 45773500 Raleigh      NC            "Are you l~ Ark.  <NA>   TRUE   FALSE      
##  3 45463453 Sebec        ME            "Please co~ Beij~ China  FALSE  FALSE      
##  4 44759410 Sebec        ME            "Please co~ Beij~ China  FALSE  FALSE      
##  5 44759409 Sebec        ME            "Please co~ Beij~ China  FALSE  FALSE      
##  6 45319754 Southampton  NY            "You can f~ Buddy <NA>   TRUE   FALSE      
##  7 44828661 Manchester   NH            "*This dog~ Buddy <NA>   TRUE   FALSE      
##  8 45672267 Litchfield ~ AZ            "ANGEL (Pu~ Chih~ Arizo~ FALSE  FALSE      
##  9 46028329 Egg Harbor ~ NJ            "BENNY & L~ Chih~ <NA>   TRUE   FALSE      
## 10 45985438 Egg Harbor ~ NJ            "BENNY & L~ Chih~ <NA>   TRUE   FALSE      
## 11 45404474 Chester Spr~ PA            "Warren th~ Chih~ <NA>   TRUE   FALSE

horrible<- dog_travel%>%filter(found=="HORRIBLE")

After looking into the data, we do not need to clean them. They are names of places instead of wrong data entries. Buddy City is a Day Care, Boarding, Grooming and Training facility for dogs located in the city center of Shanghai. Chihuahua is the largest state in Mexico.

manual - character - The data was manually reviewed and cleaned to remove any original locations that contained only a vague location (e.g., the south, the Carolinas) in favor of those explicitly described.

dog_travel<- dog_travel%>%
  mutate(found_new=ifelse(is.na(manual),found,manual))

Therefore, we create a new variablefound_new to combine found and manual.

remove - logical - Animal removed from location

still_there - logical - TRUE/FALSE - Whether the animal is still located in their origin location and will be transported to their final destination after adoption.

freq_count(dog_travel,still_there,5)

## # A tibble: 2 x 2
##   still_there freq_sum
##   <lgl>          <int>
## 1 FALSE           5875
## 2 TRUE             319

Only 319 dogs are still there (TRUE).

2.4 dog_descriptions dataset

There are 58,180 rows in this dataset. Each row represents an individual adoptable dog in the US on September 20, 2019. Each dog has a unique ID number. Unless otherwise noted, all of the data is exactly is reported by the shelter or rescue that posted an individual animal for adoption on PetFinder.

Dataset dimensions

dim(dog_descriptions)

## [1] 58180    31

head(dog_descriptions)

## # A tibble: 6 x 31
##         id org_id url    breed_primary breed_secondary breed_mixed breed_unknown
##      <dbl> <chr>  <chr>  <chr>         <chr>           <lgl>       <lgl>        
## 1 46042150 NV163  https~ American Sta~ Mixed Breed     TRUE        FALSE        
## 2 46042002 NV163  https~ Pit Bull Ter~ Mixed Breed     TRUE        FALSE        
## 3 46040898 NV99   https~ Shepherd      <NA>            FALSE       FALSE        
## 4 46039877 NV202  https~ German Sheph~ <NA>            FALSE       FALSE        
## 5 46039306 NV184  https~ Dachshund     <NA>            FALSE       FALSE        
## 6 46039304 NV184  https~ Boxer         Beagle          TRUE        FALSE        
## # ... with 24 more variables: color_primary <chr>, color_secondary <chr>,
## #   age <chr>, sex <chr>, size <chr>, coat <chr>, fixed <lgl>,
## #   house_trained <lgl>, special_needs <lgl>, shots_current <lgl>,
## #   env_children <lgl>, env_dogs <lgl>, env_cats <lgl>, name <chr>,
## #   status <chr>, posted <dttm>, contact_city <chr>, contact_state <chr>,
## #   contact_zip <chr>, contact_country <chr>, stateQ <chr>, accessed <date>,
## #   type <chr>, description <chr>

summary(dog_descriptions)

##        id              org_id              url            breed_primary     
##  Min.   :  604115   Length:58180       Length:58180       Length:58180      
##  1st Qu.:44516882   Class :character   Class :character   Class :character  
##  Median :45596624   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :44251485                                                           
##  3rd Qu.:45916674                                                           
##  Max.   :46043149                                                           
##  breed_secondary    breed_mixed     breed_unknown   color_primary     
##  Length:58180       Mode :logical   Mode :logical   Length:58180      
##  Class :character   FALSE:16589     FALSE:58180     Class :character  
##  Mode  :character   TRUE :41591                     Mode  :character  
##                                                                       
##                                                                       
##                                                                       
##  color_secondary        age                sex                size          
##  Length:58180       Length:58180       Length:58180       Length:58180      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      coat             fixed         house_trained   special_needs  
##  Length:58180       Mode :logical   Mode :logical   Mode :logical  
##  Class :character   FALSE:11559     FALSE:37624     FALSE:56034    
##  Mode  :character   TRUE :46621     TRUE :20556     TRUE :2146     
##                                                                    
##                                                                    
##                                                                    
##  shots_current   env_children     env_dogs        env_cats      
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:16112     FALSE:4439      FALSE:3547      FALSE:6810     
##  TRUE :42068     TRUE :23588     TRUE :31122     TRUE :12542    
##                  NA's :30153     NA's :23511     NA's :38828    
##                                                                 
##                                                                 
##      name              status              posted                   
##  Length:58180       Length:58180       Min.   :2003-05-18 00:00:00  
##  Class :character   Class :character   1st Qu.:2019-04-22 07:02:14  
##  Mode  :character   Mode  :character   Median :2019-08-13 15:40:24  
##                                        Mean   :2019-03-09 14:18:36  
##                                        3rd Qu.:2019-09-09 17:25:28  
##                                        Max.   :2019-09-20 17:32:30  
##  contact_city       contact_state      contact_zip        contact_country   
##  Length:58180       Length:58180       Length:58180       Length:58180      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     stateQ             accessed              type           description       
##  Length:58180       Min.   :2019-09-20   Length:58180       Length:58180      
##  Class :character   1st Qu.:2019-09-20   Class :character   Class :character  
##  Mode  :character   Median :2019-09-20   Mode  :character   Mode  :character  
##                     Mean   :2019-09-20                                        
##                     3rd Qu.:2019-09-20                                        
##                     Max.   :2019-09-20

id - double - The unique PetFinder identification number for each animal.

quantile(dog_descriptions$id)

##       0%      25%      50%      75%     100% 
##   604115 44516882 45596624 45916674 46043149

org_id - character - The unique identification number for each shelter or rescue.

freq_count(dog_descriptions,org_id, 5)

## Selecting by freq_sum

## # A tibble: 6 x 2
##   org_id freq_sum
##   <chr>     <int>
## 1 GA423       473
## 2 GA217       418
## 3 NY1274      311
## 4 AZ414       263
## 5 AZ101       241
## 6 NJ376       241

These are top 5 shelter/ rescue organizations and GA423 ranked the first with 473 rescue records.

url - character - The URL for each animal’s listing.

head(dog_descriptions$url, 5)

## [1] "https://www.petfinder.com/dog/harley-46042150/nv/las-vegas/animal-network-nv163/?referrer_id=87b31e7d-4508-41d1-95ff-fdb59b9d4669"              
## [2] "https://www.petfinder.com/dog/biggie-46042002/nv/las-vegas/animal-network-nv163/?referrer_id=87b31e7d-4508-41d1-95ff-fdb59b9d4669"              
## [3] "https://www.petfinder.com/dog/ziggy-46040898/nv/mesquite/city-of-mesquite-animal-shelter-nv99/?referrer_id=87b31e7d-4508-41d1-95ff-fdb59b9d4669"
## [4] "https://www.petfinder.com/dog/gypsy-46039877/nv/pahrump/pets-are-worth-saving-paws-nv202/?referrer_id=87b31e7d-4508-41d1-95ff-fdb59b9d4669"     
## [5] "https://www.petfinder.com/dog/theo-46039306/nv/henderson/wagging-tails-rescue-nv184/?referrer_id=87b31e7d-4508-41d1-95ff-fdb59b9d4669"

species - character - Species of animal. REMOVED in prior step (all rows are Dogs)

breed_primary - character - The primary (assumed) breed assigned by the shelter or rescue.

head(table(dog_descriptions$breed_primary))

## 
##    Affenpinscher     Afghan Hound Airedale Terrier           Akbash 
##               17                4               19                3 
##            Akita Alaskan Malamute 
##              181               72

x<- freq_count(dog_descriptions,breed_primary,200)

## Selecting by freq_sum

Most frequent breed within the data is Pit Bull Terrier.

breed_secondary - character - The secondary (assumed) breed assigned by the shelter or rescue.

head(table(dog_descriptions$breed_secondary))

## 
##    Affenpinscher     Afghan Hound Airedale Terrier           Akbash 
##               18                1                6               10 
##            Akita Alaskan Malamute 
##               28               13

breed_mixed - logical - Whether or not an animal is presumed to be mixed breed.

summary(dog_descriptions$breed_mixed)

##    Mode   FALSE    TRUE 
## logical   16589   41591

breed_unknown - logical - Whether or not the animal’s breed is completely unknown. REMOVED - These were all FALSE so attribute is removed

table(dog_descriptions$breed_unknown)

## 
## FALSE 
## 58180

dog_descriptions <- 
  dog_descriptions %>% 
  mutate(breed_unknown = NULL)

color_primary - character - The most prevalent color of an animal.

table(dog_descriptions$color_primary)

## 
##                  Apricot / Beige                          Bicolor 
##                             1322                             1415 
##                            Black                          Brindle 
##                             7639                             1953 
##                Brown / Chocolate                           Golden 
##                             2434                              817 
##             Gray / Blue / Silver                        Harlequin 
##                             1174                               73 
##                     Merle (Blue)                      Merle (Red) 
##                              186                              142 
##          Red / Chestnut / Orange                            Sable 
##                             1652                              159 
## Tricolor (Brown, Black, & White)                    White / Cream 
##                             2094                             3134 
##      Yellow / Tan / Blond / Fawn 
##                             1940

freq_count(dog_descriptions,color_primary,5)

## Selecting by freq_sum

## # A tibble: 5 x 2
##   color_primary                    freq_sum
##   <chr>                               <int>
## 1 <NA>                                32046
## 2 Black                                7639
## 3 White / Cream                        3134
## 4 Brown / Chocolate                    2434
## 5 Tricolor (Brown, Black, & White)     2094

Nearly half of the color information are missing.

color_secondary - character - The second most prevalent color of an animal.

table(dog_descriptions$color_secondary)

## 
##                  Apricot / Beige                          Bicolor 
##                               45                               55 
##                            Black                          Brindle 
##                              497                              269 
##                Brown / Chocolate                           Golden 
##                              801                              184 
##             Gray / Blue / Silver                        Harlequin 
##                              296                               37 
##                     Merle (Blue)                      Merle (Red) 
##                               33                               29 
##          Red / Chestnut / Orange                            Sable 
##                              339                               34 
## Tricolor (Brown, Black, & White)                    White / Cream 
##                              124                             7285 
##      Yellow / Tan / Blond / Fawn 
##                             2031

color_tertiary - character - The third most prevalent color of an animal. REMOVED in a prior step

age - character - The assumed age class of an animal (Baby, Young, Adult, or Senior).

table(dog_descriptions$age)

## 
##  Adult   Baby Senior  Young 
##  27955   9397   4634  16194

sex - character - The sex of an animal (Female, Male, or Unknown).

table(dog_descriptions$sex)

## 
##  Female    Male Unknown 
##   27883   30294       3

size - character - The general size class of an animal (Small, Medium, Large, Extra Large).

table(dog_descriptions$size)

## 
## Extra Large       Large      Medium       Small 
##         931       15761       29908       11580

coat - character - Coat Length for each animal (Curly, Hairless, Long, Medium, Short, Wire).

table(dog_descriptions$coat)

## 
##    Curly Hairless     Long   Medium    Short     Wire 
##      135       16     1185     4927    20671      251

fixed - logical - Whether or not an animal has been spayed/neutered.

table(dog_descriptions$fixed)

## 
## FALSE  TRUE 
## 11559 46621

house_trained - logical - Whether or not an animal is trained to not go to the bathroom in the house.

table(dog_descriptions$house_trained)

## 
## FALSE  TRUE 
## 37624 20556

declawed - logical - Whether or not the animal has had its dewclaws removed. REMOVED in a prior step all were FALSE

special_needs - logical - Whether or not the animal is considered to have special needs (this can be a long-term medical condition or particular temperament that requires extra care).

table(dog_descriptions$special_needs)

## 
## FALSE  TRUE 
## 56034  2146

shots_current - logical - Whether or not the animal is up to date on all of their vaccines and other shots.

table(dog_descriptions$shots_current)

## 
## FALSE  TRUE 
## 16112 42068

env_children - logical - Whether or not the animal is recommended for a home with children.

table(dog_descriptions$env_children)

## 
## FALSE  TRUE 
##  4439 23588

env_dogs - logical - Whether or not the animal is recommended for a home with other dogs.

table(dog_descriptions$env_dogs)

## 
## FALSE  TRUE 
##  3547 31122

env_cats - logical - Whether or not the animal is recommended for a home with cats.

table(dog_descriptions$env_cats)

## 
## FALSE  TRUE 
##  6810 12542

name - character - The animal’s name (as given by the shelter/rescue). This data was cleaned up a bit. There were Identification number and other extra information that for the analysis that we are looking to perform will be distracting and confusing to the end users. It might be necessary to do additional cleanup as we proceed. We are starting by getting the numbers that would be meaningless to the end user cleaned up.

The process is using regular expressions to update the values with just an id number to “Unknown”. Note: includes some notes Like “sponsored” “courtesy post” Id numbers of some kind

head(dog_descriptions$name, 20)

##  [1] "HARLEY"      "BIGGIE"      "Ziggy"       "Gypsy"       "Theo"       
##  [6] "Oliver"      "Macadamia"   "Dodger"      "Huckleberry" "Fagin"      
## [11] "Speckles"    "Cashew"      "Dash"        "Sydney"      "HENNA"      
## [16] "RUBO"        "LEGO"        "MARIE"       "RINGO"       "TONY"

dim(dog_descriptions)

## [1] 58180    30

dog_descriptions <- dog_descriptions %>% 
  mutate(name = str_to_title(name))

dog_descriptions <- dog_descriptions %>% 
  mutate(name = ifelse( str_detect(name, "^[0-9]*$"), "Unknown", name) ) %>% 
  mutate(name = ifelse( !str_detect(name, "[A-Z][A-z]*"), "Unknown", name) ) %>% 
  mutate(name = ifelse( str_detect(name, "^A[0-9]"), "Unknown", name) )

tags - character - Any tags given to the dog by the shelter rescue (pipe | separated). REMOVED in a prior step

photo - character - The URL to the animal’s primary photo. REMOVED in a prior step

status - character - Whether the animal is adoptable or not.

summary(dog_descriptions$status)

##    Length     Class      Mode 
##     58180 character character

posted - character - The date that this animal was first listed on PetFinder .

This single data-time attribute has been separated into separate attributes for each of the periods, posted_year, posted_month, …

dog_descriptions <-
  dog_descriptions %>% 
  separate(posted, c('posted_year', 'posted_month', 'posted_day', 'posted_hour', 'posted_min', 'posted_sec'), sep = "([- :])")


dog_descriptions %>% 
  select('posted_year', 'posted_month', 'posted_day', 'posted_hour', 'posted_min', 'posted_sec') %>% 
  head()

## # A tibble: 6 x 6
##   posted_year posted_month posted_day posted_hour posted_min posted_sec
##   <chr>       <chr>        <chr>      <chr>       <chr>      <chr>     
## 1 2019        09           20         16          37         59        
## 2 2019        09           20         16          24         57        
## 3 2019        09           20         14          10         11        
## 4 2019        09           20         10          08         22        
## 5 2019        09           20         06          48         30        
## 6 2019        09           20         06          43         59

contact_city - character - The rescue/shelter’s listed city.

Clean up: This field had different issues with cases. This effecting the way that the cities were grouped together.

The process used was a mapping tribble and then joining this to the original data and updating. Another way would have been to change the attribute to Title case throughout. This would have taken less code but would not ensure that all the modifications would have been approprtate as the menthod allows. For example Washington DC would be converted to Washington Dc if title case would have been used.

map <- tribble(
  ~contact_city, ~adj_city,
  "abingdon", "Abingdon",
  "lake charles", "Lake Charles",
  "JERSEY CITY", "Jersey City",
  "aitkin", "Aitkin",
  "cary", "Cary",
  "carson", "Carson",
  "BROOKLYN", "Brooklyn",
  "denver", "Denver",
  "grandville", "Grandville",
  "GREER","Greer",
  "BERLIN", "Berlin",
  "YERINGTON", "Yerington",
  "SACRAMENTO","Sacramento",
  "LINDSAY","Lindsay",
  "MONUMENT","Monument",
  "WHEAT RIDGE","Wheat Ridge",
  "NASSAU","Nassau",                
  "NEWARK","Newark",
  "MARION","Marion",
  "TROUTMAN","Troutman",
  "JIM THORPE","Jim Thorpe",
  "POCONO SUMMIT","Pocono Summit",
  "MARYSVILLE","Marysville",
  "reno", "Reno",
  "mount vernon","Mount Vernon",
  "modesto","Modesto",
  "ridgefield","Ridgefield",
  "taunton","Taunton",
  "garden city park","Garden City Park",
  "churchton","Churchton",
  "huntingtown","Huntingtown",
  "fayetteville","Fayetteville",
  "canton","Canton",
  "seattle","Seattle"
)

dog_descriptions <- 
  dog_descriptions %>% 
  left_join(map) %>% 
  mutate(contact_city = if_else(is.na(adj_city), contact_city, adj_city))

## Joining, by = "contact_city"

dog_descriptions %>% distinct(contact_city) %>% head()

## # A tibble: 6 x 1
##   contact_city 
##   <chr>        
## 1 Las Vegas    
## 2 Mesquite     
## 3 Pahrump      
## 4 Henderson    
## 5 Bullhead City
## 6 Kingman

Taking a close look of the original data, it’s data entry problem. Value or NA was not entered for wrong columns, thus columnsstatus,contact_city,contact_state,contact_contry,contact_zip, stateQ,accessed were shifted (mismatching column names). we cleaned them in the original data. For example, In the original data id=“31426754”: There is one record have data entry that had put state in city cell (the data was shifted). Therefore, we used and loaded new data instead.

table(dog_descriptions$contact_city)

#check if the data is corrected
#a<- dog_descriptions%>%
#  filter(id=="31426754")

contact_state - character - The rescue/shelter’s listed state. Noted that in the original data there are numeric values in the contact_state attribute. There’s 32 rows of entries that is zipcode instead of state abbreviation code. Check new data if cleaned:

## 
##   AK   AL   AR   AZ   CA   CO   CT   DC   DE   FL   GA   HI   IA   ID   IL   IN 
##   15 1428  695 2249 1664 1773 1422  336  296 2659 3479   69  485   49 1115 1880 
##   KS   KY   LA   MA   MD   ME   MI   MN   MO   MS   MT   NB   NC   ND   NE   NH 
##  470 1123  913  946 1494  545  673  958  920  510   18    2 2627   64  120  335 
##   NJ   NM   NV   NY   OH   OK   OR   PA   QC   RI   SC   SD   TN   TX   UT   VA 
## 3022  637  858 4007 2673 1636   91 2825   14  607 1618   24 1771  566  485 3061 
##   VT   WA   WI   WV   WY 
##  510 1284  542  565   52

contact_zip - character - The rescue/shelter’s listed zip code.

summary(dog_descriptions$contact_zip)

##    Length     Class      Mode 
##     58180 character character

contact_country - character - The rescue/shelter’s listed country. Noted that there are a few countries that would seem to have state codes entered. These seem to be related to the similar issue seen in the state attribute above.

table(dog_descriptions$contact_country)

## 
##    CA    US 
##    16 58164

stateQ - character - The state abbreviation queried in the API to return this result .

table(dog_descriptions$stateQ)

## 
## 89009 89014 89024 89027 89121 89406 89408 89423 89431 89451 89801    AK    AL 
##   747     1   284     2    25   181     8   435    37   134     5    15  1043 
##    AR    AZ    CA    CO    CT    DC    DE    FL    GA    HI    IA    ID    IL 
##   695  2170  1028  1773  6730  4669  2367  2659  3439    69   485    49  1119 
##    IN    KS    KY    LA    MA    MD    ME    MI    MN    MO    MS    MT    NC 
##  1920   470  1673   913   909    36   134   673   957   920   510    18  3085 
##    ND    NE    NH    NJ    NM    NY    OH    OK    OR    PA    RI    SC    SD 
##    64   120   382  1863   637   669  1901  1636    91   337     1  1137    24 
##    TN    TX    UT    VA    VT    WA    WI    WV    WY 
##  2325   566   199  1167   118  1284   543   677    52

accessed - double - The date that this data was acquired from the PetFinder API.

This date was separated into is date components.

dog_descriptions <-
  dog_descriptions %>% separate(accessed, c('accessed_year', 'accessed_month', 'accessed_day' ), sep = '-') 
  
  
dog_descriptions %>% 
  select(accessed_year, accessed_month, accessed_day) %>% 
  head()

## # A tibble: 6 x 3
##   accessed_year accessed_month accessed_day
##   <chr>         <chr>          <chr>       
## 1 2019          09             20          
## 2 2019          09             20          
## 3 2019          09             20          
## 4 2019          09             20          
## 5 2019          09             20          
## 6 2019          09             20

type - character - The type of animal.

table(dog_descriptions$type)

## 
##   Dog 
## 57540

description - character - The full description of an animal, as entered by the rescue or shelter. This is the only field returned by the V1 API.

head(dog_descriptions$description,1)

## [1] "Harley is not sure how he wound up at shelter in his senior years but as you see from the pictures the shelter asked if we could find a real home for this active senior boy. You would never know he is 9 years old. Very playful and loves humans and all the dogs he has met at adoptions he seems to like. He is 59 lbs so still pretty strong for a senior but loves walks and you have to love his ears. If you would like to meet this happy go lucky boy please contact AdoptAnimalNetwork@gmail.com. Updated pictures this Sunday."

We intend to use information of adoption fee and transport fee and age in the $description column,only a small percentage of age and fees information can be extracted from the description columns.

dog_descriptions %>% 
  select(description) %>% 
  mutate(age_extract = str_detect(description, "years old"),
         fee_extract=str_detect(description,"fee$"))%>%
  summarize(pct_age = mean(age_extract, na.rm = TRUE),
            pct_fee = mean(fee_extract, na.rm = TRUE))

## # A tibble: 1 x 2
##   pct_age  pct_fee
##     <dbl>    <dbl>
## 1   0.154 0.000141

2.5 Combined data

2.5.1 Create Travel Detail data set

A new data set travel_detail is created containing information about specific dogs travels where it is available. This new table is obtained by linking the dog_descriptions data set with the dog_travel data set using the provided id variable. The new data set provides a easy way to retrieve additional information about individual dogs travels where this information is provided.

# join the dog_descriptions and the dog_travel using the id that was provided.
travel_detail <- dog_descriptions %>% 
  left_join(dog_travel, by="id")

2.5.2 Create State Summary <–> Detail data set

To provide the ability to easily drill down to the detail about a given state based on the summary data the dog_moves is combined with the dog_description based on the state.We create a state_codes data for join purpose.

## join the summary table dog_moves with the detail (descriptions)

# Create a look up table to translate from the state name to the state code.
state_codes <- tibble(state = state_fulname,
                      abbrev = state_abrv)

# join the dog_moves to the dog_descriptions by state/state code through state_codes, the look up table
summary_detail <- dog_moves %>% 
  inner_join(state_codes, by = c("location" = "state")) %>% 
  inner_join(dog_descriptions, by = c("abbrev" = 'contact_state'))

2.5.3 Check State Summary <–> Detail data set

Check to make sure that the summary total in the dog_moves data set matches the total number of rows from the details data set (dog_description). This “dim()” call should return 0 x number of columns.

# make sure that the totals still add up
summary_detail %>% 
  group_by(location) %>% 
  mutate(tot = max(total), count = n()) %>% 
  filter(tot != count) %>% 
  dim()

## [1] 24767    46

3 Exploratary Data Analysis

3.1 EDA by visualization

We will start EDA by exploring and visualizing the variables within each data sets.

3.1.1 Dog Moves Dataset EDA

We build interactive plots for the country a dog is in to provide information to the user.

Dog Location

shared_dogmoves <- SharedData$new(dog_moves)
bscols(widths = c(3,NA,NA),
       list(
         filter_checkbox("ctr", "In the US?", shared_dogmoves, ~inUS, inline = TRUE),
         filter_select("st", "Location Name", shared_dogmoves, ~location)
       ),
       d3scatter(shared_dogmoves, ~total, ~exported, ~factor(inUS), width="100%", height=250)
)

In the US?

FALSE TRUE

Location Name

Above we can see a plot comparing the total number of dogs plotted against the total number of exported dogs in our dataset with an interactive filter to view only dogs in the US vs. those outside of the US, as well as a dropdown location selector. We could incorporate a more advanced version of this inside of our shiny app to give insight as to the total number of dogs in a given location, as well as whether they are generally exported or imported.

barplot

par(mfrow=c(2,2))

#inUS
barplot(table(dog_moves$inUS), ylab = "Count of Dogs", main = "$inUS:Dogs in vs. not in the US")

#Exported dogs
barplot(dog_moves$total ~ dog_moves$location, xlab = "Location", ylab = "# of Exported Dogs", main = "# of Exported Dogs by Location", las = 2)

#imported dogs
barplot(dog_moves$imported ~ dog_moves$location, xlab = "Location", ylab = "# of Dogs", main = "# of Imported Dogs by Location", las = 2)

#total dogs
barplot(dog_moves$total ~ dog_moves$location, xlab = "Location", ylab = "# of Dogs", main = "Total # of Dogs by Location", las = 2)

top 5 exported countries:

sum<- dog_moves%>%
  group_by(country)%>%
  summarise(sum_export= sum(exported),
            sum_import= sum(imported))%>%
  filter(country!="US")%>%
  arrange(desc(sum_export))

head(sum)

## # A tibble: 6 x 3
##   country     sum_export sum_import
##   <chr>            <dbl>      <dbl>
## 1 Puerto Rico        131          0
## 2 South Korea         76          0
## 3 Mexico              54          0
## 4 China               28          0
## 5 Thailand            20          0
## 6 India               18          0

3.1.2 Dog Travel Dataset EDA

contact state

We intend to gain geographical visualization of travel summarized by contact state. We could also do similar with found.

freq_count(dog_travel,contact_state,5)

## # A tibble: 5 x 2
##   contact_state freq_sum
##   <chr>            <int>
## 1 VA                1025
## 2 WA                 634
## 3 NJ                 552
## 4 NY                 490
## 5 MD                 379

State Virginia state has the largest number of rescue/shelter.

found_new

dog_travel%>%
  filter(!duplicated(id))%>%
  freq_count(found_new,5)

## # A tibble: 5 x 2
##   found_new      freq_sum
##   <chr>             <int>
## 1 Texas               633
## 2 Alabama             325
## 3 Georgia             196
## 4 North Carolina      189
## 5 Virginia            153

State Texas has the biggest output: 633 dogs were found/rescued in Texas.

#contact_state
travel <- dog_travel %>%
  filter(contact_state%in%state_abrv)%>%
  freq_count(contact_state,51)%>%
  mutate(state=contact_state)%>%
  data.frame()

travel$freq_sum <- as.numeric(travel$freq_sum)

# found: join the dog_travel with state_codes
found <- dog_travel %>%
  left_join(state_codes, by = c("found_new" = "state"))%>%
  mutate(state_new=abbrev)%>%
  freq_count(state_new,51)%>%
  na.omit()%>%
  mutate(state=state_new)%>%
  data.frame()

found$freq_sum <- as.numeric(found$freq_sum)

#map- contact state

plot_usmap(data = travel, 
           values = 'freq_sum', 
          include=travel$state, 
          color = "red",
          labels = TRUE) + 
 scale_fill_continuous(
   low = "white", high = "red", name = "# of dogs",label = scales::comma) + 
  labs(title = "US States", subtitle = "$contact_state: Number of dogs group by state")

#map- found
plot_usmap(data = found, 
           values = 'freq_sum', 
          include=found$state, 
          color = "blue",
          labels = TRUE) + 
 scale_fill_continuous(
   low = "white", high = "blue", name = "# of dogs",label = scales::comma) + 
  labs(title = "US States", subtitle = "$found: Number of dogs group by state")

description

fm <- dog_descriptions[-c(2:19,21:27,32:38)]
fm.zip <- geocode_zip(fm$contact_zip)
names(fm)[5] <- "zipcode"
names(fm)[5]

## [1] "zipcode"

fm<- merge(fm.zip, fm, by='zipcode')



us <- c(left = -125, bottom = 25.75, right = -67, top = 49)
get_stamenmap(us, zoom = 5, maptype = "toner-lite") %>% ggmap()

state <- subset(fm, fm$contact_state == "OH")
qmplot(lng, lat, data = fm, maptype = "toner-lite", color = I("blue"))

qmplot(lng, lat, data = state, maptype = "toner-lite", color = I("blue"))

We built a word cloud on variabledescription as it’s mainly text and wishing to gain more information from the text mining and the most frequent word.

set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

remove and still_there

#remove
remove_found <- dog_travel %>%
  left_join(state_codes, by = c("found" = "state"))%>%
  mutate(state=abbrev)%>%
  na.omit()

#remove
plota<- ggplot(remove_found,aes(state, fill=factor(remove)))+
  geom_bar()+
  scale_y_continuous(name = "Count", labels =scales::comma)+
  coord_flip()+
  ggtitle("Remove by states")

#still_there
plotb<- ggplot(remove_found,aes(state, fill=factor(still_there)))+
  geom_bar()+
  scale_y_continuous(name = "Count", labels =scales::comma)+
  coord_flip()+
  ggtitle("Still_there by states")

grid.arrange(plota, plotb, ncol=2, nrow =1)

3.1.3 Dog Descriptions Dataset EDA

We intend to used bar chart/ pie chart on categorical variables eg, breed, color, sex,color,coat. We also intend to use parallel coordinates on variables foundand contact state/city.

On top of that, we intend to gain new information by slicing and grouping by the state, we can get more summarized information by state and then look into information within each state.

For not mixed breed, these are the breed with most counted number within each state:

#for not mixed breed most number of breed in each state
breed<- dog_descriptions%>%
  filter(breed_mixed=="FALSE")%>%
  group_by(contact_state,breed_primary)%>%
  summarise(breed_sum=n())%>%
  arrange(contact_state, desc(breed_sum))%>%
  slice(1)

table(breed$breed_primary)

## 
## American Staffordshire Terrier            Australian Shepherd 
##                              1                              2 
##                          Boxer                      Chihuahua 
##                              1                              5 
##                      Coonhound            German Shepherd Dog 
##                              1                              3 
##             Labrador Retriever                    Mixed Breed 
##                              5                              8 
##               Pit Bull Terrier 
##                             25

We can see that as not mixed breed, Pit Bull Terrier as top 1 breed in 25 states.

For mixed breed, these are the breed_primary with most counted number within each state:

#for  mixed breed most number of breed in each state
mixed<-dog_descriptions%>%
  filter(breed_mixed=="TRUE")%>%
  group_by(contact_state,breed_primary)%>%
  summarise(breed_mix_sum=n())%>%
  arrange(contact_state, desc(breed_mix_sum))%>%
  slice(1)

table(mixed$breed_primary)

## 
## Australian Shepherd       Border Collie           Chihuahua    Golden Retriever 
##                   1                   1                   1                   1 
##               Husky  Labrador Retriever         Mixed Breed    Pit Bull Terrier 
##                   1                  29                   4                  13 
##            Shar-Pei            Shepherd 
##                   1                   1

We can see that for Labrador Retriever as breed_primary has the most number: 29 states, followed by Pit Bull Terrier (13 states).

3.2 Combined Dataset EDA

We will continue with EDA by exploring and visualizing the variables between each data sets.

3.2.1 relationship among three dataset:

The original dog_move dataset is derived from cleaned 2,460 rows of Travel dataset. Each row represents a single dog that was available for adoption somewhere in the US. Each of these dogs is described as having been moved from another location to their current location.

sum(dog_moves$exported)

## [1] 2460

sum(dog_moves$imported)

## [1] 2460

moves_sub<- dog_moves%>%select("location","country")

subset of dog_travel in us with distinct id (one row of record).

c= moves_sub$location

travel_moves<- dog_travel%>%
  left_join(moves_sub, by=c("found_new"="location"))%>%
  distinct(id, .keep_all = TRUE)%>%
  filter(country== "US")%>%
  group_by(found_new)%>%
  summarise(n=n())

travel_contact<- dog_travel%>%
  left_join(moves_sub, by=c("found_new"="location"))%>%
  distinct(id, .keep_all = TRUE)%>%
  filter(country== "US")%>%
  group_by(contact_state)%>%
  summarise(n=n())

#exported = travel_moves - travel_contact

# #most_exported <- dog_travel%>%
#   distinct(id, .keep_all = TRUE)%>%
#   count(found_new, sort = TRUE)
# 
# #most_imported <- dog_travel %>%
#   distinct(id, .keep_all = TRUE)%>%
#   count(contact_state, sort = TRUE) 

#subset of dog_travel in us and distinct id
travel_sub_1<-dog_travel%>%
  filter(found_new%in%state_fulname)%>%
  group_by(id)%>%
  distinct(id, .keep_all = TRUE)%>%
  arrange(id)

we can notice that: the sum count of each found(excluding the number from their states, i.e contact_state=found) is approximately the summary number in dog_moves.dataset.

Note: we could not find the source of which 2460 rows of the original travel dataset were selected from the description dataset. Therefore, we computed the sub_total of each location using the dog_travel.csv dataset.

3.2.2 Combine data: travel left_join description

Connect travel and description data, create a new variable called origin_bi: a binary variable, assign 1 if the id is in travel datasets (meaning these dogs whose description indicates that they did not originate in the state where they were made available for adoption ).

travel_id= travel_sub_1$id

dog_descriptions<- dog_descriptions%>%
  mutate(origin_bi=ifelse(id%in%travel_id,1,0))

 # combine_sum%>%
 #   filter(origin_bi==1)%>%
 #   arrange(desc(n))

combine_sum<- dog_descriptions%>%
  group_by(contact_state, origin_bi)%>%
  summarise(n=n())%>%
  mutate(freq = formattable::percent(n/sum(n)))%>% 
  arrange(desc(freq))%>%
  mutate(origin_bi_detail=ifelse(origin_bi==1, "imported", "exported"))

## `summarise()` has grouped output by 'contact_state'. You can override using the `.groups` argument.

ggplot(data=combine_sum, aes(x=contact_state,y=freq, fill=factor(origin_bi_detail)))+
  geom_bar(position="stack", stat="identity")+
  scale_y_continuous(name = "Percentage", labels =scales::percent)+
  coord_flip()+
  ggtitle("Import dog percentage by states")

Washington (WA): 19.27% – has the highest import percentage of dogs. NY has the highest imported number of dogs: 351 (around 8.76%).

3.2.3 Combine data: description left_join travel

We left joined description with travel and got a new dataset named: travel_detatil. Let’s explore an example in this dataset: There is a dog called Dahui, his id number is “44759409”. We can track his origin (China) and exported place (Maine).

travel_detail%>%
  filter(id=="44759409")%>%
  select(id, org_id,url, breed_primary, breed_secondary, age, sex, size,name, contact_city.x,found, found_new, description.x)

## # A tibble: 11 x 13
##          id org_id url     breed_primary breed_secondary age   sex   size  name 
##       <dbl> <chr>  <chr>   <chr>         <chr>           <chr> <chr> <chr> <chr>
##  1 44759409 ME165  https:~ German Sheph~ Belgian Shephe~ Young Male  Medi~ Da-H~
##  2 44759409 ME165  https:~ German Sheph~ Belgian Shephe~ Young Male  Medi~ Da-H~
##  3 44759409 ME165  https:~ German Sheph~ Belgian Shephe~ Young Male  Medi~ Da-H~
##  4 44759409 ME165  https:~ German Sheph~ Belgian Shephe~ Young Male  Medi~ Da-H~
##  5 44759409 ME165  https:~ German Sheph~ Belgian Shephe~ Young Male  Medi~ Da-H~
##  6 44759409 ME165  https:~ German Sheph~ Belgian Shephe~ Young Male  Medi~ Da-H~
##  7 44759409 ME165  https:~ German Sheph~ Belgian Shephe~ Young Male  Medi~ Da-H~
##  8 44759409 ME165  https:~ German Sheph~ Belgian Shephe~ Young Male  Medi~ Da-H~
##  9 44759409 ME165  https:~ German Sheph~ Belgian Shephe~ Young Male  Medi~ Da-H~
## 10 44759409 ME165  https:~ German Sheph~ Belgian Shephe~ Young Male  Medi~ Da-H~
## 11 44759409 ME165  https:~ German Sheph~ Belgian Shephe~ Young Male  Medi~ Da-H~
## # ... with 4 more variables: contact_city.x <chr>, found <chr>,
## #   found_new <chr>, description.x <chr>

Da hui

3.2.4 Combine data: travel left_join state (state populations)

We combined state population data with the descriptions data in order to better understand the number of adoptable dogs with respect to the state population. The state totals were summarized and used to calculate the number of dogs per 100,000 people for each of the states.

state <- read.csv("https://raw.github.com/rtkoenig/datawrangling_data/master/state_pop.csv")
state <- tibble(state)
 
# prepare the data for the hex map
dog_descriptions %>%
  left_join(state, by = c("contact_state" = "abbrev") ) %>%
  group_by(contact_state, State) %>%
  summarise(num_dogs = n(), state_pop = max(Pop) )  %>%
  mutate(dogs_per_100k = num_dogs / (state_pop / 100000)) %>%
  select(state = State, contact_state, num_dogs, state_pop, dogs_per_100k) %>%
  filter(!is.na(dogs_per_100k)) %>%
  mutate(id = state, group = paste0(state, ".1")) %>%
  select(id, state, contact_state, num_dogs, state_pop, dogs_per_100k,
          #lat = latitude, long = longitude,
          group ) ->   dog_data

## `summarise()` has grouped output by 'contact_state'. You can override using the `.groups` argument.

3.3 Frequent Subsequence Mining Proposal

Looking into the word cloud we generated, the most frequent words that have been mentioned are not very useful for our adoption decision evaluation. We could further explore the presence of certain types of words in certain types of dogs to recommend similar records to a user interacting with our Shiny app.

We decide to try with Frequent itemset mining (FPA).

3.4 relationships among the three datasets:

From the example of Dahui, we can notice that we are empowered of summary and detailed information of adoptable dogs. The datasets is intended to provide as much trustworthy information to pet-finders as possible.

relationship

4 Conclusion

We explore three dataset: dog_move.csv, dog_travel.csv,dog_description.csv and found the relationships among the three dataset as well as more detailed information on the adopted dogs.

5 Appendix: Dictionary

5.1 dog_moves data

Variable	Class	description
`location`	character	The full name of the US state or country
`exported`	double	The number of adoptable dogs available in the US that originated in this location but were available for adoption in another location
`imported`	double	The number of adoptable dogs available in this state that originated in a different location
`total`	double	The total number of adoptable dogs availabe in a given state.
`inUS`	logical	Whether or not a location is in the US or not. Here, US territories will return FALSE

5.2 dog_travel data

Variable	Class	description
`id`	double	The unique PetFinder identification number for each animal
`contact_city`	character	The rescue/shelter’s listed city
`contact_state`	character	The rescue/shelter’s listed State
`description`	character	The full description of each animal as entered by the rescue/shelter
`found`	character	Where the animal was found.
`manual`	character	.
`remove`	logical	Animal removed from location
`still_there`	logical TRUE/FALSE	- Whether the animal is still located in their origin location and will be transported to their final destination after adoption.

5.3 dog_descriptions data

Variable	Class	description
`id`	double	The unique PetFinder identification number for each animal.
`org_id`	character	The unique identification number for each shelter or rescue.
`url`	character	The URL for each animal’s listing.
`species`	character	Species of animal.
`breed_primary`	character	The primary (assumed) breed assigned by the shelter or rescue.
`breed_secondary`	character	The secondary (assumed) breed assigned by the shelter or rescue.
`breed_mixed`	logical	Whether or not an animal is presumed to be mixed breed.
`breed_unknown`	logical	Whether or not the animal’s breed is completely unknown.
`color_primary`	character	The most prevalent color of an animal.
`color_secondary`	character	The second most prevalent color of an animal.
`color_tertiary`	character	The third most prevalent color of an animal.
`age`	character	The assumed age class of an animal (Baby, Young, Adult, or Senior).
`sex`	character	The sex of an animal (Female, Male, or Unknown).
`size`	character	The general size class of an animal (Small, Medium, Large, Extra Large).
`coat`	character	Coat Length for each animal (Curly, Hairless, Long, Medium, Short, Wire).
`fixed`	logical	Whether or not an animal has been spayed/neutered.
`house_trained`	logical	Whether or not an animal is trained to not go to the bathroom in the house.
`declawed`	logical	Whether or not the animal has had its dewclaws removed.
`special_needs`	logical	Whether or not the animal is considered to have special needs (this can be a long-term medical condition or particular temperament that requires extra care).
`shots_current`	logical	Whether or not the animal is up to date on all of their vaccines and other shots.
`env_children`	logical	Whether or not the animal is recommended for a home with children.
`env_dogs`	logical	Whether or not the animal is recommended for a home with other dogs.
`env_cats`	logical	Whether or not the animal is recommended for a home with cats.
`name`	character	The animal’s name (as given by the shelter/rescue).
`tags`	character	Any tags given to the dog by the shelter rescue (pipe
`photo`	character	The URL to the animal’s primary photo.
`status`	character	Whether the animal is adoptable or not.
`posted`	character	The date that this animal was first listed on PetFinder .
`contact_city`	character	The rescue/shelter’s listed city.
`contact_state`	character	The rescue/shelter’s listed state.
`contact_zip`	character	The rescue/shelter’s listed zip code.
`contact_country`	character	The rescue/shelter’s listed country.
`stateQ`	character	The state abbreviation queried in the API to return this result .
`accessed`	double	The date that this data was acquired from the PetFinder API.
`type`	character	The type of animal.
`description`	character	The full description of an animal, as entered by the rescue or shelter. This is the only field returned by the V1 API.

Adoptable Dogs project

John Trygier, Yuqiu Wang, Carrie Vennefron, Dhruv Cairae, Bob Koenig

11/13/2021