Approximately 3.1 million dogs are sheltered annually in the United States. At these shelters, about 1.2 million dogs are put down annually in the United States. Even the dogs which are not put down are often scared and confused, which makes them less adoptable due to their behavior during visits. However, there may be a way to determine the adoptability of canines, which shelters could potentially use to optimize their adoption process. This would result in more shelter dogs being homed and not being re-sheltered. The first step is to understand what makes specific dogs “adoptable” or not.
The data used for investigating the adaptability of shelter dogs is that of The Pudding’s Amber Thomas and Sacha Maxim. The method that the team plans to use is by cleaning, exploring, and analyzing said data such that correlations and patterns can be illustrated which may explain the trends by which Americans adopt dogs.
Our current proposed approach/analytic technique which we believe will help to address this problem is Frequent Subsequence Mining or Subset Mining. By filtering for certain traits of dogs and observing if these traits possibly contribute to their eventual adoption. For example, do Americans prefer to adopt dogs which can be bred (like those who have not been spayed or neutered), or would a dog be more likely to be adopted if there was no chance for unwanted pregnancy? Does location affect this? Perhaps there are more dog breeders in southern states as opposed to northern states. Which breeds of dogs are more likely to be adopted, and are adopters willing to import or export their pets to get the specific breed that they want? By wrangling this data and exploring these factors, perhaps a clearer understanding of a dog’s adaptability can be gleaned.
This analysis will help shelters enhance their adoption process to target certain American demographics with different desires for the dogs they choose to adopt. This way, adoption can be facilitated because the shelters can either make their dogs more adoptable by declawing, grooming, training, sterilizing, importing, exporting, etc. Improved adoptability of dogs will then theoretically lead to increased rates of adoption, which will decrease overall suffering of sheltered dogs.
The package which was used in this analysis and is therefore required to replicate it is tidyverse. We chose to use this package because it is comprised of several other packages, including: ggplot2 (for creating graphics), dplyr (for data manipulation), tidyr (to help clean the data), readr (to read rectangular data), tibble (to modify data frames), etc. Our team used the tidyverse package because we wanted to utilize the following commands which are available in the tidyverse package: readr::read_csv, tibble(), mutate(), etc.
We also used the hwardcloud and wordcloud2 packages to generate word clouds. Additionally, package tm was used for text mining, package SnowballC was used for text stemming, and package RColorBrewer was used for access to different color palettes. We use the devtools github package, including the libraries d3scatter and crosstalk to generate interactive plots.
## Installing package into 'C:/Users/John Trygier/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)
## package 'tm' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'tm'
## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying C:
## \Users\John Trygier\Documents\R\win-library\4.0\00LOCK\tm\libs\x64\tm.dll to C:
## \Users\John Trygier\Documents\R\win-library\4.0\tm\libs\x64\tm.dll: Permission
## denied
## Warning: restored 'tm'
##
## The downloaded binary packages are in
## C:\Users\John Trygier\AppData\Local\Temp\RtmpGS3LpP\downloaded_packages
## Installing package into 'C:/Users/John Trygier/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)
## package 'SnowballC' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'SnowballC'
## Warning in file.copy(savedcopy, lib, recursive = TRUE):
## problem copying C:\Users\John Trygier\Documents\R\win-
## library\4.0\00LOCK\SnowballC\libs\x64\SnowballC.dll to C:\Users\John
## Trygier\Documents\R\win-library\4.0\SnowballC\libs\x64\SnowballC.dll: Permission
## denied
## Warning: restored 'SnowballC'
##
## The downloaded binary packages are in
## C:\Users\John Trygier\AppData\Local\Temp\RtmpGS3LpP\downloaded_packages
## Installing package into 'C:/Users/John Trygier/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)
## package 'wordcloud' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'wordcloud'
## Warning in file.copy(savedcopy, lib, recursive = TRUE):
## problem copying C:\Users\John Trygier\Documents\R\win-
## library\4.0\00LOCK\wordcloud\libs\x64\wordcloud.dll to C:\Users\John
## Trygier\Documents\R\win-library\4.0\wordcloud\libs\x64\wordcloud.dll: Permission
## denied
## Warning: restored 'wordcloud'
##
## The downloaded binary packages are in
## C:\Users\John Trygier\AppData\Local\Temp\RtmpGS3LpP\downloaded_packages
## Installing package into 'C:/Users/John Trygier/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)
## package 'RColorBrewer' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\John Trygier\AppData\Local\Temp\RtmpGS3LpP\downloaded_packages
## Installing package into 'C:/Users/John Trygier/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)
## Warning: package 'd3scatter' is not available (for R version 4.0.2)
## Installing package into 'C:/Users/John Trygier/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)
##
## There is a binary version available but the source version is later:
## binary source needs_compilation
## devtools 2.4.2 2.4.3 FALSE
## installing the source package 'devtools'
## Skipping install of 'crosstalk' from a github remote, the SHA1 (8128ef3b) has not changed since last install.
## Use `force = TRUE` to force installation
## Skipping install of 'd3scatter' from a github remote, the SHA1 (aba6687f) has not changed since last install.
## Use `force = TRUE` to force installation
## Installing package into 'C:/Users/John Trygier/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)
## package 'lubridate' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'lubridate'
## Warning in file.copy(savedcopy, lib, recursive = TRUE):
## problem copying C:\Users\John Trygier\Documents\R\win-
## library\4.0\00LOCK\lubridate\libs\x64\lubridate.dll to C:\Users\John
## Trygier\Documents\R\win-library\4.0\lubridate\libs\x64\lubridate.dll: Permission
## denied
## Warning: restored 'lubridate'
##
## The downloaded binary packages are in
## C:\Users\John Trygier\AppData\Local\Temp\RtmpGS3LpP\downloaded_packages
## Installing package into 'C:/Users/John Trygier/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)
##
## There is a binary version available but the source version is later:
## binary source needs_compilation
## glue 1.5.0 1.5.1 TRUE
##
## Binaries will be installed
## package 'glue' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'glue'
## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying C:
## \Users\John Trygier\Documents\R\win-library\4.0\00LOCK\glue\libs\x64\glue.dll
## to C:\Users\John Trygier\Documents\R\win-library\4.0\glue\libs\x64\glue.dll:
## Permission denied
## Warning: restored 'glue'
##
## The downloaded binary packages are in
## C:\Users\John Trygier\AppData\Local\Temp\RtmpGS3LpP\downloaded_packages
## Installing package into 'C:/Users/John Trygier/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)
## package 'gridExtra' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\John Trygier\AppData\Local\Temp\RtmpGS3LpP\downloaded_packages
## Installing package into 'C:/Users/John Trygier/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)
## package 'lattice' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'lattice'
## Warning in file.copy(savedcopy, lib, recursive = TRUE):
## problem copying C:\Users\John Trygier\Documents\R\win-
## library\4.0\00LOCK\lattice\libs\x64\lattice.dll to C:\Users\John
## Trygier\Documents\R\win-library\4.0\lattice\libs\x64\lattice.dll: Permission
## denied
## Warning: restored 'lattice'
##
## The downloaded binary packages are in
## C:\Users\John Trygier\AppData\Local\Temp\RtmpGS3LpP\downloaded_packages
## Installing package into 'C:/Users/John Trygier/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)
## package 'leaflet' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\John Trygier\AppData\Local\Temp\RtmpGS3LpP\downloaded_packages
## Installing package into 'C:/Users/John Trygier/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)
## package 'formattable' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\John Trygier\AppData\Local\Temp\RtmpGS3LpP\downloaded_packages
## Installing package into 'C:/Users/John Trygier/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)
## package 'geojsonio' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\John Trygier\AppData\Local\Temp\RtmpGS3LpP\downloaded_packages
## Installing package into 'C:/Users/John Trygier/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)
## Warning in install.packages("rgeos", type = "source", repos = "http://cran.us.r-
## project.org"): installation of package 'rgeos' had non-zero exit status
## Installing package into 'C:/Users/John Trygier/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)
## Warning in install.packages("rgdal", type = "source", repos = "http://cran.us.r-
## project.org"): installation of package 'rgdal' had non-zero exit status
## Installing package into 'C:/Users/John Trygier/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)
## package 'RColorBrewer' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\John Trygier\AppData\Local\Temp\RtmpGS3LpP\downloaded_packages
## Installing package into 'C:/Users/John Trygier/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)
## package 'viridis' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\John Trygier\AppData\Local\Temp\RtmpGS3LpP\downloaded_packages
## Installing package into 'C:/Users/John Trygier/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)
## package 'zipcodeR' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\John Trygier\AppData\Local\Temp\RtmpGS3LpP\downloaded_packages
## Installing package into 'C:/Users/John Trygier/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)
## package 'ggmap' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\John Trygier\AppData\Local\Temp\RtmpGS3LpP\downloaded_packages
# Load Libraries
library(tidyverse)
library(dplyr)
library(tidyr)
library(readr)
library(tibble)
library(broom)
library(ggplot2)
library(gridExtra)
library(grid)
library(lattice)
library(formattable)
library(zipcodeR)
library(ggmap)
library(geojsonio)
library(RColorBrewer)
library(rgdal)
library(rgeos)
library(viridis)
library(usmap)
library(tm)
library(SnowballC)
library(wordcloud)
library(RColorBrewer)
library(leaflet)
library(crosstalk)
library(d3scatter)
library(lubridate)
library(glue)
Create 2 variables named state_fulnameand state_abrv representing full name and abbreviation of 51 states to avoid duplication.
state_fulname = c('Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware', 'Florida', 'Georgia','Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts','Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey','New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island','South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington', 'West Virginia','Wisconsin', 'Wyoming', 'Washington DC')
state_abrv= c('AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'FL', 'GA', 'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME','MD', 'MA', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY', 'NC', 'ND', 'OH', 'OK', 'OR', 'PA','RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY', 'DC')
Create a Function named freq_count to avoid iteration. The function has argument of df(the data frame), group_var(the variable we want to group_by) and n(top n biggest frequency count numbers).
freq_count <- function(df, group_var, n) {
require(dplyr)
group_var <- enquo(group_var) # need to quote
#summary_var <- enquo(summary_var)
df %>%
group_by(!!group_var) %>% # !! unquotes
summarise(freq_sum = n())%>%
arrange(desc(freq_sum))%>%
top_n(n=n)
}
Data for this project comes from Adoptable Dogs on the R for Data Science Github repository. here
This data was originally used in a project by Amber Thomas and designed by Sacha Maxim. This data was included with their article, Finding Forever Homes and uses this adoptable dogs data from 09-20-2019 from Petfinder.com.
Included are three data sets. Two of which can be joined and at third that is a summary data set.
Note: The original datadog_descriptions.csv have data entry mistakes. Thus, we corrected on the data entries and load the new data from our personal github repository.
Load Data and library
dog_moves <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-12-17/dog_moves.csv')
dog_travel <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-12-17/dog_travel.csv')
dogtraveltry<- readr::read_csv("https://raw.githubusercontent.com/the-pudding/data/master/dog-shelters/dogTravel.csv")
#original data:
#dog_descriptions <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-12-17/dog_descriptions.csv')
#new dog_descriptions
dog_descriptions<-readr::read_csv('https://raw.githubusercontent.com/njuteresa2019/MS-BA-data-wrangling/main/dog_descriptions.csv')
Remove Missing Values: Move dataset
summary(dog_moves)
## location exported imported total
## Length:90 Min. : 1.00 Min. : 1.00 Min. : 15.0
## Class :character 1st Qu.: 1.00 1st Qu.: 11.25 1st Qu.: 477.5
## Mode :character Median : 4.00 Median : 24.50 Median : 857.0
## Mean : 30.37 Mean : 64.74 Mean :1139.8
## 3rd Qu.: 18.00 3rd Qu.: 73.25 3rd Qu.:1650.0
## Max. :635.00 Max. :390.00 Max. :4002.0
## NA's :9 NA's :52 NA's :39
## inUS
## Mode :logical
## FALSE:39
## TRUE :51
##
##
##
##
x <- dog_moves[is.na(dog_moves$total),]
print(tibble(x), n=10)
## # A tibble: 39 x 5
## location exported imported total inUS
## <chr> <dbl> <dbl> <dbl> <lgl>
## 1 Puerto Rico 131 NA NA FALSE
## 2 South Korea 76 NA NA FALSE
## 3 Mexico 54 NA NA FALSE
## 4 China 28 NA NA FALSE
## 5 Thailand 20 NA NA FALSE
## 6 India 18 NA NA FALSE
## 7 Taiwan 18 NA NA FALSE
## 8 Egypt 15 NA NA FALSE
## 9 Bahamas 13 NA NA FALSE
## 10 Spain 8 NA NA FALSE
## # ... with 29 more rows
Treatment of missing values: Travel and Description dataset
As per description The number of adoptable dogs available in the US that originated in this location but were available for adoption in another location and vice versa for imported dogs. Total is simply total dogs available. For imported and exported we can replace missing values with 0, although as we don’t want to lose entire rows and possibly it might actually be 0 as well. Apart from Indianapolis rest of the values are foreign countries for total so it does not matter much as we are not focused on total dogs available for adoption outside the US. It will simply be transformed into 0. Logically and for our analysis, this would not have an impact.
dog_moves$total[is.na(dog_moves$total)] <- 0
dog_moves$imported[is.na(dog_moves$imported)] <- 0
dog_moves$exported[is.na(dog_moves$exported)] <- 0
We have missing values for remove and still_there. We might not have this data collected but it does have to exist Animal removed from location is for remove. still_there is Whether the animal is still located in their origin location and will be transported to their final destination after adoption. Ideally interaction with source of data would clarify this. For remove we can assume NA is false which will imply animal is still present at location. Similarly for still there.These movements are for dogs in transit for adoption and anything else we can assume the large chuck of dogs are not moving and hence NA turned into FALSE.
dog_travel$remove[is.na(dog_travel$remove)] <- FALSE
dog_travel$still_there[is.na(dog_travel$still_there)] <- FALSE
summary(dog_travel)
## id contact_city contact_state description
## Min. : 8619716 Length:6194 Length:6194 Length:6194
## 1st Qu.:44940096 Class :character Class :character Class :character
## Median :45734990 Mode :character Mode :character Mode :character
## Mean :44786280
## 3rd Qu.:45923182
## Max. :46043149
## found manual remove still_there
## Length:6194 Length:6194 Mode :logical Mode :logical
## Class :character Class :character FALSE:4456 FALSE:5875
## Mode :character Mode :character TRUE :1738 TRUE :319
##
##
##
Declawed, color_tertiary, photo, and tags are removed as nearly all values here are missing. Species is dog for every row so it adds no additional value.
dog_descriptions <- subset(dog_descriptions,select = -c(declawed,species,color_tertiary ,photo ,tags))
Checking for counts of NA’s by attribute
Remaining missing values are all in character variables, the NA values can be replaced by UNKNOWN character string for models which don’t tolerate missing values.
colSums(is.na(dog_descriptions))
## id org_id url breed_primary breed_secondary
## 0 0 0 0 37359
## breed_mixed breed_unknown color_primary color_secondary age
## 0 0 32046 46121 0
## sex size coat fixed house_trained
## 0 0 30995 0 0
## special_needs shots_current env_children env_dogs env_cats
## 0 0 30153 23511 38828
## name status posted contact_city contact_state
## 0 0 0 0 0
## contact_zip contact_country stateQ accessed type
## 12 0 0 0 640
## description
## 8705
colSums(is.na(dog_travel))
## id contact_city contact_state description found
## 0 0 0 0 0
## manual remove still_there
## 4047 0 0
colSums(is.na(dog_moves))
## location exported imported total inUS
## 0 0 0 0 0
Evaluation of the data types and values
In this section we evaluate the individual variables. Looking at the datatype and values in the attribute.
Is the datatype appropriate?
Are there data that will disrupt the analysis, including outliers?
Run summary on the attribute.
Evaluate that data. Running additional reports and plots to inspect the data including but not limited to tidy queries, visually inspecting output and plots.
Run any cleanup or tidying that was appropriate.
Provide visual representation, table or other output to communicate the contents.
Included below all the attributes and the summary views that were chosen to represent the data. Where cleanup or other manipulation was needed, this code is included and explained in more detail.
This dataset was derived from Dog Descriptions.csv and Dog Travel.csv file to find the total numbers of imports and exports for each location. This data represents a single day of data. It was all collected on September 20, 2019. The script to process the file can be found here.
dim(dog_moves)
## [1] 90 5
head(dog_moves)
## # A tibble: 6 x 5
## location exported imported total inUS
## <chr> <dbl> <dbl> <dbl> <lgl>
## 1 Texas 635 0 566 TRUE
## 2 Alabama 268 2 1428 TRUE
## 3 North Carolina 158 14 2627 TRUE
## 4 South Carolina 139 12 1618 TRUE
## 5 Georgia 137 19 3479 TRUE
## 6 Puerto Rico 131 0 0 FALSE
summary(dog_moves)
## location exported imported total
## Length:90 Min. : 0.00 Min. : 0.00 Min. : 0.0
## Class :character 1st Qu.: 1.00 1st Qu.: 0.00 1st Qu.: 0.0
## Mode :character Median : 3.00 Median : 0.00 Median : 66.5
## Mean : 27.33 Mean : 27.33 Mean : 645.9
## 3rd Qu.: 16.50 3rd Qu.: 15.50 3rd Qu.: 939.5
## Max. :635.00 Max. :390.00 Max. :4002.0
## inUS
## Mode :logical
## FALSE:39
## TRUE :51
##
##
##
location - character - The full name of the US state or country.
freq_count(dog_moves,location,5)
## # A tibble: 90 x 2
## location freq_sum
## <chr> <int>
## 1 Afghanistan 1
## 2 Alabama 1
## 3 Alaska 1
## 4 Arizona 1
## 5 Arkansas 1
## 6 Aruba 1
## 7 Azerbaijan 1
## 8 Bahamas 1
## 9 Bahrain 1
## 10 Bosnia 1
## # ... with 80 more rows
There are 90 rows in this dataset. Each row represents a specific US state or country.
exported - double - The number of adoptable dogs available in the US that originated in this location but were available for adoption in another location
imported - double - The number of adoptable dogs available in this state that originated in a different location
total - double - The total number of adoptable dogs available in a given state.
inUS - logical - Whether or not a location is in the US or not. Here, US territories will return FALSE
freq_count(dog_moves,inUS,5)
## # A tibble: 2 x 2
## inUS freq_sum
## <lgl> <int>
## 1 TRUE 51
## 2 FALSE 39
There are 51 observations that are in the U.S. and 39 from other countries. We can create a new variable country and fill with “U.S.” and other country names.
dog_moves<- dog_moves%>%
mutate(country= ifelse(inUS=="TRUE","US",location))
This dataset was derived from Dog Descriptions.csv. This file only includes information on dogs whose description indicates that they did not originate in the state where they were made available for adoption. This file aims to show where those dogs are available and where they came from. This data represents a single day of data. It was all collected on September 20, 2019. The script to process the file can be found here.
dim(dog_travel)
## [1] 6194 8
head(dog_travel)
## # A tibble: 6 x 8
## id contact_city contact_state description found manual remove still_there
## <dbl> <chr> <chr> <chr> <chr> <chr> <lgl> <lgl>
## 1 44520267 Anoka MN "Boris is ~ Arka~ <NA> FALSE FALSE
## 2 44698509 Groveland FL "Duke is a~ Abac~ Baham~ FALSE FALSE
## 3 45983838 Adamstown MD "Zac Woof-~ Adam Maryl~ FALSE FALSE
## 4 44475904 Saint Cloud MN "~~Came in~ Adap~ <NA> TRUE FALSE
## 5 43877389 Pueblo CO "Palang is~ Afgh~ <NA> FALSE FALSE
## 6 43082511 Manchester CT "Brooke ha~ Afgh~ <NA> FALSE FALSE
summary(dog_travel)
## id contact_city contact_state description
## Min. : 8619716 Length:6194 Length:6194 Length:6194
## 1st Qu.:44940096 Class :character Class :character Class :character
## Median :45734990 Mode :character Mode :character Mode :character
## Mean :44786280
## 3rd Qu.:45923182
## Max. :46043149
## found manual remove still_there
## Length:6194 Length:6194 Mode :logical Mode :logical
## Class :character Class :character FALSE:4456 FALSE:5875
## Mode :character Mode :character TRUE :1738 TRUE :319
##
##
##
id - double - The unique PetFinder identification number for each animal
contact_city - character - The rescue/shelter’s listed city. Cleaning: change zipcode=17325 to it’s city name Gettysburg.
#table(dog_travel$contact_city)
dog_travel%>%
filter(nchar(contact_city)<3)
## # A tibble: 15 x 8
## id contact_city contact_state description found manual remove still_there
## <dbl> <chr> <chr> <chr> <chr> <chr> <lgl> <lgl>
## 1 45362806 OH OH "Liberty c~ Cinc~ Ohio FALSE FALSE
## 2 36978896 PA PA "Maddie is~ Mary~ <NA> TRUE FALSE
## 3 33218331 PA PA "Born in A~ Mary~ <NA> TRUE FALSE
## 4 36978896 PA PA "Maddie is~ New ~ <NA> TRUE FALSE
## 5 33218331 PA PA "Born in A~ New ~ <NA> TRUE FALSE
## 6 36978896 PA 17325 "Maddie is~ New ~ <NA> TRUE FALSE
## 7 33218331 PA 17325 "Born in A~ New ~ <NA> TRUE FALSE
## 8 36978896 PA 17325 "Maddie is~ Penn~ <NA> TRUE FALSE
## 9 33218331 PA 17325 "Born in A~ Penn~ <NA> TRUE FALSE
## 10 36978896 PA 17325 "Maddie is~ Virg~ <NA> TRUE FALSE
## 11 33218331 PA 17325 "Born in A~ Virg~ <NA> TRUE FALSE
## 12 36978896 PA 17325 "Maddie is~ Wash~ <NA> TRUE FALSE
## 13 33218331 PA 17325 "Born in A~ Wash~ <NA> TRUE FALSE
## 14 36978896 PA 17325 "Maddie is~ West~ <NA> TRUE FALSE
## 15 33218331 PA 17325 "Born in A~ West~ <NA> TRUE FALSE
#cleaning
dog_travel<- dog_travel%>%
mutate(contact_city=ifelse(contact_state=="17325","Gettysburg",contact_city))
contact_state - character - The rescue/shelter’s listed State. Cleaning: change the zipcode to the state code name.
table(dog_travel$contact_state)
##
## 17325 AL AR AZ CA CO CT DC DE FL GA IA IL
## 10 75 10 70 28 103 90 112 57 133 109 22 37
## IN KS KY LA MA MD ME MI MN MO MS NB NC
## 65 3 31 5 107 379 361 67 190 42 6 2 187
## NH NJ NM NV NY OH OK OR PA RI SC TN TX
## 44 552 82 20 490 177 31 32 316 130 21 112 2
## UT VA VT WA WI WV
## 66 1025 49 634 83 27
#cleaning
dog_travel$contact_state[dog_travel$contact_state=="17325"]<-'PA'
description - character - The full description of each animal as entered by the rescue/shelter
head(dog_travel$description, 1)
## [1] "Boris is a handsome mini schnauzer who made his long trek up her from Arkansas on 4/2019. He loves rope toys and running around his foster mom's yard. \n\nHe is 4 years old, just under 10 pounds, and is a special needs dog. He needs vitamin B-12 shots every other week (may change), and needs prescription dog food with no meet or protein. Boris is on a vegan diet.\n\nIf interested in Boris please fill out or adoption application so we can set up a meet and greet for you! His foster mom can explain more in-depth about his needs and can answer any questions you may have about his lifestyle.\n\nBoris' adoption fee is $300 and the application can be found at http://www.aohrescue.org/dog-adoption-application"
found - character - Where the animal was found. Note: this is a mixed bag of values States/countries/cities
table(dog_travel$found)
The entries looks abnormal:
dog_travel%>%
filter(found%in%c("Beijing","Ark","Ark.","Buddy","Chihuahua"))
## # A tibble: 11 x 8
## id contact_city contact_state description found manual remove still_there
## <dbl> <chr> <chr> <chr> <chr> <chr> <lgl> <lgl>
## 1 44182772 Crescent Ci~ FL "Go to www~ Ark <NA> TRUE FALSE
## 2 45773500 Raleigh NC "Are you l~ Ark. <NA> TRUE FALSE
## 3 45463453 Sebec ME "Please co~ Beij~ China FALSE FALSE
## 4 44759410 Sebec ME "Please co~ Beij~ China FALSE FALSE
## 5 44759409 Sebec ME "Please co~ Beij~ China FALSE FALSE
## 6 45319754 Southampton NY "You can f~ Buddy <NA> TRUE FALSE
## 7 44828661 Manchester NH "*This dog~ Buddy <NA> TRUE FALSE
## 8 45672267 Litchfield ~ AZ "ANGEL (Pu~ Chih~ Arizo~ FALSE FALSE
## 9 46028329 Egg Harbor ~ NJ "BENNY & L~ Chih~ <NA> TRUE FALSE
## 10 45985438 Egg Harbor ~ NJ "BENNY & L~ Chih~ <NA> TRUE FALSE
## 11 45404474 Chester Spr~ PA "Warren th~ Chih~ <NA> TRUE FALSE
horrible<- dog_travel%>%filter(found=="HORRIBLE")
After looking into the data, we do not need to clean them. They are names of places instead of wrong data entries. Buddy City is a Day Care, Boarding, Grooming and Training facility for dogs located in the city center of Shanghai. Chihuahua is the largest state in Mexico.
manual - character - The data was manually reviewed and cleaned to remove any original locations that contained only a vague location (e.g., the south, the Carolinas) in favor of those explicitly described.
dog_travel<- dog_travel%>%
mutate(found_new=ifelse(is.na(manual),found,manual))
Therefore, we create a new variablefound_new to combine found and manual.
remove - logical - Animal removed from location
still_there - logical - TRUE/FALSE - Whether the animal is still located in their origin location and will be transported to their final destination after adoption.
freq_count(dog_travel,still_there,5)
## # A tibble: 2 x 2
## still_there freq_sum
## <lgl> <int>
## 1 FALSE 5875
## 2 TRUE 319
Only 319 dogs are still there (TRUE).
There are 58,180 rows in this dataset. Each row represents an individual adoptable dog in the US on September 20, 2019. Each dog has a unique ID number. Unless otherwise noted, all of the data is exactly is reported by the shelter or rescue that posted an individual animal for adoption on PetFinder.
Dataset dimensions
dim(dog_descriptions)
## [1] 58180 31
head(dog_descriptions)
## # A tibble: 6 x 31
## id org_id url breed_primary breed_secondary breed_mixed breed_unknown
## <dbl> <chr> <chr> <chr> <chr> <lgl> <lgl>
## 1 46042150 NV163 https~ American Sta~ Mixed Breed TRUE FALSE
## 2 46042002 NV163 https~ Pit Bull Ter~ Mixed Breed TRUE FALSE
## 3 46040898 NV99 https~ Shepherd <NA> FALSE FALSE
## 4 46039877 NV202 https~ German Sheph~ <NA> FALSE FALSE
## 5 46039306 NV184 https~ Dachshund <NA> FALSE FALSE
## 6 46039304 NV184 https~ Boxer Beagle TRUE FALSE
## # ... with 24 more variables: color_primary <chr>, color_secondary <chr>,
## # age <chr>, sex <chr>, size <chr>, coat <chr>, fixed <lgl>,
## # house_trained <lgl>, special_needs <lgl>, shots_current <lgl>,
## # env_children <lgl>, env_dogs <lgl>, env_cats <lgl>, name <chr>,
## # status <chr>, posted <dttm>, contact_city <chr>, contact_state <chr>,
## # contact_zip <chr>, contact_country <chr>, stateQ <chr>, accessed <date>,
## # type <chr>, description <chr>
summary(dog_descriptions)
## id org_id url breed_primary
## Min. : 604115 Length:58180 Length:58180 Length:58180
## 1st Qu.:44516882 Class :character Class :character Class :character
## Median :45596624 Mode :character Mode :character Mode :character
## Mean :44251485
## 3rd Qu.:45916674
## Max. :46043149
## breed_secondary breed_mixed breed_unknown color_primary
## Length:58180 Mode :logical Mode :logical Length:58180
## Class :character FALSE:16589 FALSE:58180 Class :character
## Mode :character TRUE :41591 Mode :character
##
##
##
## color_secondary age sex size
## Length:58180 Length:58180 Length:58180 Length:58180
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## coat fixed house_trained special_needs
## Length:58180 Mode :logical Mode :logical Mode :logical
## Class :character FALSE:11559 FALSE:37624 FALSE:56034
## Mode :character TRUE :46621 TRUE :20556 TRUE :2146
##
##
##
## shots_current env_children env_dogs env_cats
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:16112 FALSE:4439 FALSE:3547 FALSE:6810
## TRUE :42068 TRUE :23588 TRUE :31122 TRUE :12542
## NA's :30153 NA's :23511 NA's :38828
##
##
## name status posted
## Length:58180 Length:58180 Min. :2003-05-18 00:00:00
## Class :character Class :character 1st Qu.:2019-04-22 07:02:14
## Mode :character Mode :character Median :2019-08-13 15:40:24
## Mean :2019-03-09 14:18:36
## 3rd Qu.:2019-09-09 17:25:28
## Max. :2019-09-20 17:32:30
## contact_city contact_state contact_zip contact_country
## Length:58180 Length:58180 Length:58180 Length:58180
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## stateQ accessed type description
## Length:58180 Min. :2019-09-20 Length:58180 Length:58180
## Class :character 1st Qu.:2019-09-20 Class :character Class :character
## Mode :character Median :2019-09-20 Mode :character Mode :character
## Mean :2019-09-20
## 3rd Qu.:2019-09-20
## Max. :2019-09-20
id - double - The unique PetFinder identification number for each animal.
quantile(dog_descriptions$id)
## 0% 25% 50% 75% 100%
## 604115 44516882 45596624 45916674 46043149
org_id - character - The unique identification number for each shelter or rescue.
freq_count(dog_descriptions,org_id, 5)
## Selecting by freq_sum
## # A tibble: 6 x 2
## org_id freq_sum
## <chr> <int>
## 1 GA423 473
## 2 GA217 418
## 3 NY1274 311
## 4 AZ414 263
## 5 AZ101 241
## 6 NJ376 241
These are top 5 shelter/ rescue organizations and GA423 ranked the first with 473 rescue records.
url - character - The URL for each animal’s listing.
head(dog_descriptions$url, 5)
## [1] "https://www.petfinder.com/dog/harley-46042150/nv/las-vegas/animal-network-nv163/?referrer_id=87b31e7d-4508-41d1-95ff-fdb59b9d4669"
## [2] "https://www.petfinder.com/dog/biggie-46042002/nv/las-vegas/animal-network-nv163/?referrer_id=87b31e7d-4508-41d1-95ff-fdb59b9d4669"
## [3] "https://www.petfinder.com/dog/ziggy-46040898/nv/mesquite/city-of-mesquite-animal-shelter-nv99/?referrer_id=87b31e7d-4508-41d1-95ff-fdb59b9d4669"
## [4] "https://www.petfinder.com/dog/gypsy-46039877/nv/pahrump/pets-are-worth-saving-paws-nv202/?referrer_id=87b31e7d-4508-41d1-95ff-fdb59b9d4669"
## [5] "https://www.petfinder.com/dog/theo-46039306/nv/henderson/wagging-tails-rescue-nv184/?referrer_id=87b31e7d-4508-41d1-95ff-fdb59b9d4669"
species - character - Species of animal. REMOVED in prior step (all rows are Dogs)
breed_primary - character - The primary (assumed) breed assigned by the shelter or rescue.
head(table(dog_descriptions$breed_primary))
##
## Affenpinscher Afghan Hound Airedale Terrier Akbash
## 17 4 19 3
## Akita Alaskan Malamute
## 181 72
x<- freq_count(dog_descriptions,breed_primary,200)
## Selecting by freq_sum
Most frequent breed within the data is Pit Bull Terrier.
breed_secondary - character - The secondary (assumed) breed assigned by the shelter or rescue.
head(table(dog_descriptions$breed_secondary))
##
## Affenpinscher Afghan Hound Airedale Terrier Akbash
## 18 1 6 10
## Akita Alaskan Malamute
## 28 13
breed_mixed - logical - Whether or not an animal is presumed to be mixed breed.
summary(dog_descriptions$breed_mixed)
## Mode FALSE TRUE
## logical 16589 41591
breed_unknown - logical - Whether or not the animal’s breed is completely unknown. REMOVED - These were all FALSE so attribute is removed
table(dog_descriptions$breed_unknown)
##
## FALSE
## 58180
dog_descriptions <-
dog_descriptions %>%
mutate(breed_unknown = NULL)
color_primary - character - The most prevalent color of an animal.
table(dog_descriptions$color_primary)
##
## Apricot / Beige Bicolor
## 1322 1415
## Black Brindle
## 7639 1953
## Brown / Chocolate Golden
## 2434 817
## Gray / Blue / Silver Harlequin
## 1174 73
## Merle (Blue) Merle (Red)
## 186 142
## Red / Chestnut / Orange Sable
## 1652 159
## Tricolor (Brown, Black, & White) White / Cream
## 2094 3134
## Yellow / Tan / Blond / Fawn
## 1940
freq_count(dog_descriptions,color_primary,5)
## Selecting by freq_sum
## # A tibble: 5 x 2
## color_primary freq_sum
## <chr> <int>
## 1 <NA> 32046
## 2 Black 7639
## 3 White / Cream 3134
## 4 Brown / Chocolate 2434
## 5 Tricolor (Brown, Black, & White) 2094
Nearly half of the color information are missing.
color_secondary - character - The second most prevalent color of an animal.
table(dog_descriptions$color_secondary)
##
## Apricot / Beige Bicolor
## 45 55
## Black Brindle
## 497 269
## Brown / Chocolate Golden
## 801 184
## Gray / Blue / Silver Harlequin
## 296 37
## Merle (Blue) Merle (Red)
## 33 29
## Red / Chestnut / Orange Sable
## 339 34
## Tricolor (Brown, Black, & White) White / Cream
## 124 7285
## Yellow / Tan / Blond / Fawn
## 2031
color_tertiary - character - The third most prevalent color of an animal. REMOVED in a prior step
age - character - The assumed age class of an animal (Baby, Young, Adult, or Senior).
table(dog_descriptions$age)
##
## Adult Baby Senior Young
## 27955 9397 4634 16194
sex - character - The sex of an animal (Female, Male, or Unknown).
table(dog_descriptions$sex)
##
## Female Male Unknown
## 27883 30294 3
size - character - The general size class of an animal (Small, Medium, Large, Extra Large).
table(dog_descriptions$size)
##
## Extra Large Large Medium Small
## 931 15761 29908 11580
coat - character - Coat Length for each animal (Curly, Hairless, Long, Medium, Short, Wire).
table(dog_descriptions$coat)
##
## Curly Hairless Long Medium Short Wire
## 135 16 1185 4927 20671 251
fixed - logical - Whether or not an animal has been spayed/neutered.
table(dog_descriptions$fixed)
##
## FALSE TRUE
## 11559 46621
house_trained - logical - Whether or not an animal is trained to not go to the bathroom in the house.
table(dog_descriptions$house_trained)
##
## FALSE TRUE
## 37624 20556
declawed - logical - Whether or not the animal has had its dewclaws removed. REMOVED in a prior step all were FALSE
special_needs - logical - Whether or not the animal is considered to have special needs (this can be a long-term medical condition or particular temperament that requires extra care).
table(dog_descriptions$special_needs)
##
## FALSE TRUE
## 56034 2146
shots_current - logical - Whether or not the animal is up to date on all of their vaccines and other shots.
table(dog_descriptions$shots_current)
##
## FALSE TRUE
## 16112 42068
env_children - logical - Whether or not the animal is recommended for a home with children.
table(dog_descriptions$env_children)
##
## FALSE TRUE
## 4439 23588
env_dogs - logical - Whether or not the animal is recommended for a home with other dogs.
table(dog_descriptions$env_dogs)
##
## FALSE TRUE
## 3547 31122
env_cats - logical - Whether or not the animal is recommended for a home with cats.
table(dog_descriptions$env_cats)
##
## FALSE TRUE
## 6810 12542
name - character - The animal’s name (as given by the shelter/rescue). This data was cleaned up a bit. There were Identification number and other extra information that for the analysis that we are looking to perform will be distracting and confusing to the end users. It might be necessary to do additional cleanup as we proceed. We are starting by getting the numbers that would be meaningless to the end user cleaned up.
The process is using regular expressions to update the values with just an id number to “Unknown”. Note: includes some notes Like “sponsored” “courtesy post” Id numbers of some kind
head(dog_descriptions$name, 20)
## [1] "HARLEY" "BIGGIE" "Ziggy" "Gypsy" "Theo"
## [6] "Oliver" "Macadamia" "Dodger" "Huckleberry" "Fagin"
## [11] "Speckles" "Cashew" "Dash" "Sydney" "HENNA"
## [16] "RUBO" "LEGO" "MARIE" "RINGO" "TONY"
dim(dog_descriptions)
## [1] 58180 30
dog_descriptions <- dog_descriptions %>%
mutate(name = str_to_title(name))
dog_descriptions <- dog_descriptions %>%
mutate(name = ifelse( str_detect(name, "^[0-9]*$"), "Unknown", name) ) %>%
mutate(name = ifelse( !str_detect(name, "[A-Z][A-z]*"), "Unknown", name) ) %>%
mutate(name = ifelse( str_detect(name, "^A[0-9]"), "Unknown", name) )
tags - character - Any tags given to the dog by the shelter rescue (pipe | separated). REMOVED in a prior step
photo - character - The URL to the animal’s primary photo. REMOVED in a prior step
status - character - Whether the animal is adoptable or not.
summary(dog_descriptions$status)
## Length Class Mode
## 58180 character character
posted - character - The date that this animal was first listed on PetFinder .
This single data-time attribute has been separated into separate attributes for each of the periods, posted_year, posted_month, …
dog_descriptions <-
dog_descriptions %>%
separate(posted, c('posted_year', 'posted_month', 'posted_day', 'posted_hour', 'posted_min', 'posted_sec'), sep = "([- :])")
dog_descriptions %>%
select('posted_year', 'posted_month', 'posted_day', 'posted_hour', 'posted_min', 'posted_sec') %>%
head()
## # A tibble: 6 x 6
## posted_year posted_month posted_day posted_hour posted_min posted_sec
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2019 09 20 16 37 59
## 2 2019 09 20 16 24 57
## 3 2019 09 20 14 10 11
## 4 2019 09 20 10 08 22
## 5 2019 09 20 06 48 30
## 6 2019 09 20 06 43 59
contact_city - character - The rescue/shelter’s listed city.
Clean up: This field had different issues with cases. This effecting the way that the cities were grouped together.
The process used was a mapping tribble and then joining this to the original data and updating. Another way would have been to change the attribute to Title case throughout. This would have taken less code but would not ensure that all the modifications would have been approprtate as the menthod allows. For example Washington DC would be converted to Washington Dc if title case would have been used.
map <- tribble(
~contact_city, ~adj_city,
"abingdon", "Abingdon",
"lake charles", "Lake Charles",
"JERSEY CITY", "Jersey City",
"aitkin", "Aitkin",
"cary", "Cary",
"carson", "Carson",
"BROOKLYN", "Brooklyn",
"denver", "Denver",
"grandville", "Grandville",
"GREER","Greer",
"BERLIN", "Berlin",
"YERINGTON", "Yerington",
"SACRAMENTO","Sacramento",
"LINDSAY","Lindsay",
"MONUMENT","Monument",
"WHEAT RIDGE","Wheat Ridge",
"NASSAU","Nassau",
"NEWARK","Newark",
"MARION","Marion",
"TROUTMAN","Troutman",
"JIM THORPE","Jim Thorpe",
"POCONO SUMMIT","Pocono Summit",
"MARYSVILLE","Marysville",
"reno", "Reno",
"mount vernon","Mount Vernon",
"modesto","Modesto",
"ridgefield","Ridgefield",
"taunton","Taunton",
"garden city park","Garden City Park",
"churchton","Churchton",
"huntingtown","Huntingtown",
"fayetteville","Fayetteville",
"canton","Canton",
"seattle","Seattle"
)
dog_descriptions <-
dog_descriptions %>%
left_join(map) %>%
mutate(contact_city = if_else(is.na(adj_city), contact_city, adj_city))
## Joining, by = "contact_city"
dog_descriptions %>% distinct(contact_city) %>% head()
## # A tibble: 6 x 1
## contact_city
## <chr>
## 1 Las Vegas
## 2 Mesquite
## 3 Pahrump
## 4 Henderson
## 5 Bullhead City
## 6 Kingman
Taking a close look of the original data, it’s data entry problem. Value or NA was not entered for wrong columns, thus columnsstatus,contact_city,contact_state,contact_contry,contact_zip, stateQ,accessed were shifted (mismatching column names). we cleaned them in the original data. For example, In the original data id=“31426754”: There is one record have data entry that had put state in city cell (the data was shifted). Therefore, we used and loaded new data instead.
table(dog_descriptions$contact_city)
#check if the data is corrected
#a<- dog_descriptions%>%
# filter(id=="31426754")
contact_state - character - The rescue/shelter’s listed state. Noted that in the original data there are numeric values in the contact_state attribute. There’s 32 rows of entries that is zipcode instead of state abbreviation code. Check new data if cleaned:
##
## AK AL AR AZ CA CO CT DC DE FL GA HI IA ID IL IN
## 15 1428 695 2249 1664 1773 1422 336 296 2659 3479 69 485 49 1115 1880
## KS KY LA MA MD ME MI MN MO MS MT NB NC ND NE NH
## 470 1123 913 946 1494 545 673 958 920 510 18 2 2627 64 120 335
## NJ NM NV NY OH OK OR PA QC RI SC SD TN TX UT VA
## 3022 637 858 4007 2673 1636 91 2825 14 607 1618 24 1771 566 485 3061
## VT WA WI WV WY
## 510 1284 542 565 52
contact_zip - character - The rescue/shelter’s listed zip code.
summary(dog_descriptions$contact_zip)
## Length Class Mode
## 58180 character character
contact_country - character - The rescue/shelter’s listed country. Noted that there are a few countries that would seem to have state codes entered. These seem to be related to the similar issue seen in the state attribute above.
table(dog_descriptions$contact_country)
##
## CA US
## 16 58164
stateQ - character - The state abbreviation queried in the API to return this result .
table(dog_descriptions$stateQ)
##
## 89009 89014 89024 89027 89121 89406 89408 89423 89431 89451 89801 AK AL
## 747 1 284 2 25 181 8 435 37 134 5 15 1043
## AR AZ CA CO CT DC DE FL GA HI IA ID IL
## 695 2170 1028 1773 6730 4669 2367 2659 3439 69 485 49 1119
## IN KS KY LA MA MD ME MI MN MO MS MT NC
## 1920 470 1673 913 909 36 134 673 957 920 510 18 3085
## ND NE NH NJ NM NY OH OK OR PA RI SC SD
## 64 120 382 1863 637 669 1901 1636 91 337 1 1137 24
## TN TX UT VA VT WA WI WV WY
## 2325 566 199 1167 118 1284 543 677 52
accessed - double - The date that this data was acquired from the PetFinder API.
This date was separated into is date components.
dog_descriptions <-
dog_descriptions %>% separate(accessed, c('accessed_year', 'accessed_month', 'accessed_day' ), sep = '-')
dog_descriptions %>%
select(accessed_year, accessed_month, accessed_day) %>%
head()
## # A tibble: 6 x 3
## accessed_year accessed_month accessed_day
## <chr> <chr> <chr>
## 1 2019 09 20
## 2 2019 09 20
## 3 2019 09 20
## 4 2019 09 20
## 5 2019 09 20
## 6 2019 09 20
type - character - The type of animal.
table(dog_descriptions$type)
##
## Dog
## 57540
description - character - The full description of an animal, as entered by the rescue or shelter. This is the only field returned by the V1 API.
head(dog_descriptions$description,1)
## [1] "Harley is not sure how he wound up at shelter in his senior years but as you see from the pictures the shelter asked if we could find a real home for this active senior boy. You would never know he is 9 years old. Very playful and loves humans and all the dogs he has met at adoptions he seems to like. He is 59 lbs so still pretty strong for a senior but loves walks and you have to love his ears. If you would like to meet this happy go lucky boy please contact AdoptAnimalNetwork@gmail.com. Updated pictures this Sunday."
We intend to use information of adoption fee and transport fee and age in the $description column,only a small percentage of age and fees information can be extracted from the description columns.
dog_descriptions %>%
select(description) %>%
mutate(age_extract = str_detect(description, "years old"),
fee_extract=str_detect(description,"fee$"))%>%
summarize(pct_age = mean(age_extract, na.rm = TRUE),
pct_fee = mean(fee_extract, na.rm = TRUE))
## # A tibble: 1 x 2
## pct_age pct_fee
## <dbl> <dbl>
## 1 0.154 0.000141
A new data set travel_detail is created containing information about specific dogs travels where it is available. This new table is obtained by linking the dog_descriptions data set with the dog_travel data set using the provided id variable. The new data set provides a easy way to retrieve additional information about individual dogs travels where this information is provided.
# join the dog_descriptions and the dog_travel using the id that was provided.
travel_detail <- dog_descriptions %>%
left_join(dog_travel, by="id")
To provide the ability to easily drill down to the detail about a given state based on the summary data the dog_moves is combined with the dog_description based on the state.We create a state_codes data for join purpose.
## join the summary table dog_moves with the detail (descriptions)
# Create a look up table to translate from the state name to the state code.
state_codes <- tibble(state = state_fulname,
abbrev = state_abrv)
# join the dog_moves to the dog_descriptions by state/state code through state_codes, the look up table
summary_detail <- dog_moves %>%
inner_join(state_codes, by = c("location" = "state")) %>%
inner_join(dog_descriptions, by = c("abbrev" = 'contact_state'))
Check to make sure that the summary total in the dog_moves data set matches the total number of rows from the details data set (dog_description). This “dim()” call should return 0 x number of columns.
# make sure that the totals still add up
summary_detail %>%
group_by(location) %>%
mutate(tot = max(total), count = n()) %>%
filter(tot != count) %>%
dim()
## [1] 24767 46
We will start EDA by exploring and visualizing the variables within each data sets.
We build interactive plots for the country a dog is in to provide information to the user.
Dog Location
shared_dogmoves <- SharedData$new(dog_moves)
bscols(widths = c(3,NA,NA),
list(
filter_checkbox("ctr", "In the US?", shared_dogmoves, ~inUS, inline = TRUE),
filter_select("st", "Location Name", shared_dogmoves, ~location)
),
d3scatter(shared_dogmoves, ~total, ~exported, ~factor(inUS), width="100%", height=250)
)
Above we can see a plot comparing the total number of dogs plotted against the total number of exported dogs in our dataset with an interactive filter to view only dogs in the US vs. those outside of the US, as well as a dropdown location selector. We could incorporate a more advanced version of this inside of our shiny app to give insight as to the total number of dogs in a given location, as well as whether they are generally exported or imported.
barplot
par(mfrow=c(2,2))
#inUS
barplot(table(dog_moves$inUS), ylab = "Count of Dogs", main = "$inUS:Dogs in vs. not in the US")
#Exported dogs
barplot(dog_moves$total ~ dog_moves$location, xlab = "Location", ylab = "# of Exported Dogs", main = "# of Exported Dogs by Location", las = 2)
#imported dogs
barplot(dog_moves$imported ~ dog_moves$location, xlab = "Location", ylab = "# of Dogs", main = "# of Imported Dogs by Location", las = 2)
#total dogs
barplot(dog_moves$total ~ dog_moves$location, xlab = "Location", ylab = "# of Dogs", main = "Total # of Dogs by Location", las = 2)
top 5 exported countries:
sum<- dog_moves%>%
group_by(country)%>%
summarise(sum_export= sum(exported),
sum_import= sum(imported))%>%
filter(country!="US")%>%
arrange(desc(sum_export))
head(sum)
## # A tibble: 6 x 3
## country sum_export sum_import
## <chr> <dbl> <dbl>
## 1 Puerto Rico 131 0
## 2 South Korea 76 0
## 3 Mexico 54 0
## 4 China 28 0
## 5 Thailand 20 0
## 6 India 18 0
contact state
We intend to gain geographical visualization of travel summarized by contact state. We could also do similar with found.
freq_count(dog_travel,contact_state,5)
## # A tibble: 5 x 2
## contact_state freq_sum
## <chr> <int>
## 1 VA 1025
## 2 WA 634
## 3 NJ 552
## 4 NY 490
## 5 MD 379
State Virginia state has the largest number of rescue/shelter.
found_new
dog_travel%>%
filter(!duplicated(id))%>%
freq_count(found_new,5)
## # A tibble: 5 x 2
## found_new freq_sum
## <chr> <int>
## 1 Texas 633
## 2 Alabama 325
## 3 Georgia 196
## 4 North Carolina 189
## 5 Virginia 153
State Texas has the biggest output: 633 dogs were found/rescued in Texas.
#contact_state
travel <- dog_travel %>%
filter(contact_state%in%state_abrv)%>%
freq_count(contact_state,51)%>%
mutate(state=contact_state)%>%
data.frame()
travel$freq_sum <- as.numeric(travel$freq_sum)
# found: join the dog_travel with state_codes
found <- dog_travel %>%
left_join(state_codes, by = c("found_new" = "state"))%>%
mutate(state_new=abbrev)%>%
freq_count(state_new,51)%>%
na.omit()%>%
mutate(state=state_new)%>%
data.frame()
found$freq_sum <- as.numeric(found$freq_sum)
#map- contact state
plot_usmap(data = travel,
values = 'freq_sum',
include=travel$state,
color = "red",
labels = TRUE) +
scale_fill_continuous(
low = "white", high = "red", name = "# of dogs",label = scales::comma) +
labs(title = "US States", subtitle = "$contact_state: Number of dogs group by state")
#map- found
plot_usmap(data = found,
values = 'freq_sum',
include=found$state,
color = "blue",
labels = TRUE) +
scale_fill_continuous(
low = "white", high = "blue", name = "# of dogs",label = scales::comma) +
labs(title = "US States", subtitle = "$found: Number of dogs group by state")
description
fm <- dog_descriptions[-c(2:19,21:27,32:38)]
fm.zip <- geocode_zip(fm$contact_zip)
names(fm)[5] <- "zipcode"
names(fm)[5]
## [1] "zipcode"
fm<- merge(fm.zip, fm, by='zipcode')
us <- c(left = -125, bottom = 25.75, right = -67, top = 49)
get_stamenmap(us, zoom = 5, maptype = "toner-lite") %>% ggmap()
state <- subset(fm, fm$contact_state == "OH")
qmplot(lng, lat, data = fm, maptype = "toner-lite", color = I("blue"))
qmplot(lng, lat, data = state, maptype = "toner-lite", color = I("blue"))
We built a word cloud on variabledescription as it’s mainly text and wishing to gain more information from the text mining and the most frequent word.
set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
remove and still_there
#remove
remove_found <- dog_travel %>%
left_join(state_codes, by = c("found" = "state"))%>%
mutate(state=abbrev)%>%
na.omit()
#remove
plota<- ggplot(remove_found,aes(state, fill=factor(remove)))+
geom_bar()+
scale_y_continuous(name = "Count", labels =scales::comma)+
coord_flip()+
ggtitle("Remove by states")
#still_there
plotb<- ggplot(remove_found,aes(state, fill=factor(still_there)))+
geom_bar()+
scale_y_continuous(name = "Count", labels =scales::comma)+
coord_flip()+
ggtitle("Still_there by states")
grid.arrange(plota, plotb, ncol=2, nrow =1)
We intend to used bar chart/ pie chart on categorical variables eg, breed, color, sex,color,coat. We also intend to use parallel coordinates on variables foundand contact state/city.
On top of that, we intend to gain new information by slicing and grouping by the state, we can get more summarized information by state and then look into information within each state.
For not mixed breed, these are the breed with most counted number within each state:
#for not mixed breed most number of breed in each state
breed<- dog_descriptions%>%
filter(breed_mixed=="FALSE")%>%
group_by(contact_state,breed_primary)%>%
summarise(breed_sum=n())%>%
arrange(contact_state, desc(breed_sum))%>%
slice(1)
table(breed$breed_primary)
##
## American Staffordshire Terrier Australian Shepherd
## 1 2
## Boxer Chihuahua
## 1 5
## Coonhound German Shepherd Dog
## 1 3
## Labrador Retriever Mixed Breed
## 5 8
## Pit Bull Terrier
## 25
We can see that as not mixed breed, Pit Bull Terrier as top 1 breed in 25 states.
For mixed breed, these are the breed_primary with most counted number within each state:
#for mixed breed most number of breed in each state
mixed<-dog_descriptions%>%
filter(breed_mixed=="TRUE")%>%
group_by(contact_state,breed_primary)%>%
summarise(breed_mix_sum=n())%>%
arrange(contact_state, desc(breed_mix_sum))%>%
slice(1)
table(mixed$breed_primary)
##
## Australian Shepherd Border Collie Chihuahua Golden Retriever
## 1 1 1 1
## Husky Labrador Retriever Mixed Breed Pit Bull Terrier
## 1 29 4 13
## Shar-Pei Shepherd
## 1 1
We can see that for Labrador Retriever as breed_primary has the most number: 29 states, followed by Pit Bull Terrier (13 states).
We will continue with EDA by exploring and visualizing the variables between each data sets.
The original dog_move dataset is derived from cleaned 2,460 rows of Travel dataset. Each row represents a single dog that was available for adoption somewhere in the US. Each of these dogs is described as having been moved from another location to their current location.
sum(dog_moves$exported)
## [1] 2460
sum(dog_moves$imported)
## [1] 2460
moves_sub<- dog_moves%>%select("location","country")
subset of dog_travel in us with distinct id (one row of record).
c= moves_sub$location
travel_moves<- dog_travel%>%
left_join(moves_sub, by=c("found_new"="location"))%>%
distinct(id, .keep_all = TRUE)%>%
filter(country== "US")%>%
group_by(found_new)%>%
summarise(n=n())
travel_contact<- dog_travel%>%
left_join(moves_sub, by=c("found_new"="location"))%>%
distinct(id, .keep_all = TRUE)%>%
filter(country== "US")%>%
group_by(contact_state)%>%
summarise(n=n())
#exported = travel_moves - travel_contact
# #most_exported <- dog_travel%>%
# distinct(id, .keep_all = TRUE)%>%
# count(found_new, sort = TRUE)
#
# #most_imported <- dog_travel %>%
# distinct(id, .keep_all = TRUE)%>%
# count(contact_state, sort = TRUE)
#subset of dog_travel in us and distinct id
travel_sub_1<-dog_travel%>%
filter(found_new%in%state_fulname)%>%
group_by(id)%>%
distinct(id, .keep_all = TRUE)%>%
arrange(id)
we can notice that: the sum count of each found(excluding the number from their states, i.e contact_state=found) is approximately the summary number in dog_moves.dataset.
Note: we could not find the source of which 2460 rows of the original travel dataset were selected from the description dataset. Therefore, we computed the sub_total of each location using the dog_travel.csv dataset.
Connect travel and description data, create a new variable called origin_bi: a binary variable, assign 1 if the id is in travel datasets (meaning these dogs whose description indicates that they did not originate in the state where they were made available for adoption ).
travel_id= travel_sub_1$id
dog_descriptions<- dog_descriptions%>%
mutate(origin_bi=ifelse(id%in%travel_id,1,0))
# combine_sum%>%
# filter(origin_bi==1)%>%
# arrange(desc(n))
combine_sum<- dog_descriptions%>%
group_by(contact_state, origin_bi)%>%
summarise(n=n())%>%
mutate(freq = formattable::percent(n/sum(n)))%>%
arrange(desc(freq))%>%
mutate(origin_bi_detail=ifelse(origin_bi==1, "imported", "exported"))
## `summarise()` has grouped output by 'contact_state'. You can override using the `.groups` argument.
ggplot(data=combine_sum, aes(x=contact_state,y=freq, fill=factor(origin_bi_detail)))+
geom_bar(position="stack", stat="identity")+
scale_y_continuous(name = "Percentage", labels =scales::percent)+
coord_flip()+
ggtitle("Import dog percentage by states")
Washington (WA): 19.27% – has the highest import percentage of dogs. NY has the highest imported number of dogs: 351 (around 8.76%).
We left joined description with travel and got a new dataset named: travel_detatil. Let’s explore an example in this dataset: There is a dog called Dahui, his id number is “44759409”. We can track his origin (China) and exported place (Maine).
travel_detail%>%
filter(id=="44759409")%>%
select(id, org_id,url, breed_primary, breed_secondary, age, sex, size,name, contact_city.x,found, found_new, description.x)
## # A tibble: 11 x 13
## id org_id url breed_primary breed_secondary age sex size name
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 44759409 ME165 https:~ German Sheph~ Belgian Shephe~ Young Male Medi~ Da-H~
## 2 44759409 ME165 https:~ German Sheph~ Belgian Shephe~ Young Male Medi~ Da-H~
## 3 44759409 ME165 https:~ German Sheph~ Belgian Shephe~ Young Male Medi~ Da-H~
## 4 44759409 ME165 https:~ German Sheph~ Belgian Shephe~ Young Male Medi~ Da-H~
## 5 44759409 ME165 https:~ German Sheph~ Belgian Shephe~ Young Male Medi~ Da-H~
## 6 44759409 ME165 https:~ German Sheph~ Belgian Shephe~ Young Male Medi~ Da-H~
## 7 44759409 ME165 https:~ German Sheph~ Belgian Shephe~ Young Male Medi~ Da-H~
## 8 44759409 ME165 https:~ German Sheph~ Belgian Shephe~ Young Male Medi~ Da-H~
## 9 44759409 ME165 https:~ German Sheph~ Belgian Shephe~ Young Male Medi~ Da-H~
## 10 44759409 ME165 https:~ German Sheph~ Belgian Shephe~ Young Male Medi~ Da-H~
## 11 44759409 ME165 https:~ German Sheph~ Belgian Shephe~ Young Male Medi~ Da-H~
## # ... with 4 more variables: contact_city.x <chr>, found <chr>,
## # found_new <chr>, description.x <chr>
Da hui
We combined state population data with the descriptions data in order to better understand the number of adoptable dogs with respect to the state population. The state totals were summarized and used to calculate the number of dogs per 100,000 people for each of the states.
state <- read.csv("https://raw.github.com/rtkoenig/datawrangling_data/master/state_pop.csv")
state <- tibble(state)
# prepare the data for the hex map
dog_descriptions %>%
left_join(state, by = c("contact_state" = "abbrev") ) %>%
group_by(contact_state, State) %>%
summarise(num_dogs = n(), state_pop = max(Pop) ) %>%
mutate(dogs_per_100k = num_dogs / (state_pop / 100000)) %>%
select(state = State, contact_state, num_dogs, state_pop, dogs_per_100k) %>%
filter(!is.na(dogs_per_100k)) %>%
mutate(id = state, group = paste0(state, ".1")) %>%
select(id, state, contact_state, num_dogs, state_pop, dogs_per_100k,
#lat = latitude, long = longitude,
group ) -> dog_data
## `summarise()` has grouped output by 'contact_state'. You can override using the `.groups` argument.
Looking into the word cloud we generated, the most frequent words that have been mentioned are not very useful for our adoption decision evaluation. We could further explore the presence of certain types of words in certain types of dogs to recommend similar records to a user interacting with our Shiny app.
We decide to try with Frequent itemset mining (FPA).
From the example of Dahui, we can notice that we are empowered of summary and detailed information of adoptable dogs. The datasets is intended to provide as much trustworthy information to pet-finders as possible.
relationship
We explore three dataset: dog_move.csv, dog_travel.csv,dog_description.csv and found the relationships among the three dataset as well as more detailed information on the adopted dogs.
| Variable | Class | description |
|---|---|---|
location |
character | The full name of the US state or country |
exported |
double | The number of adoptable dogs available in the US that originated in this location but were available for adoption in another location |
imported |
double | The number of adoptable dogs available in this state that originated in a different location |
total |
double | The total number of adoptable dogs availabe in a given state. |
inUS |
logical | Whether or not a location is in the US or not. Here, US territories will return FALSE |
| Variable | Class | description |
|---|---|---|
id |
double | The unique PetFinder identification number for each animal |
contact_city |
character | The rescue/shelter’s listed city |
contact_state |
character | The rescue/shelter’s listed State |
description |
character | The full description of each animal as entered by the rescue/shelter |
found |
character | Where the animal was found. |
manual |
character | . |
remove |
logical | Animal removed from location |
still_there |
logical TRUE/FALSE | - Whether the animal is still located in their origin location and will be transported to their final destination after adoption. |
| Variable | Class | description |
|---|---|---|
id |
double | The unique PetFinder identification number for each animal. |
org_id |
character | The unique identification number for each shelter or rescue. |
url |
character | The URL for each animal’s listing. |
species |
character | Species of animal. |
breed_primary |
character | The primary (assumed) breed assigned by the shelter or rescue. |
breed_secondary |
character | The secondary (assumed) breed assigned by the shelter or rescue. |
breed_mixed |
logical | Whether or not an animal is presumed to be mixed breed. |
breed_unknown |
logical | Whether or not the animal’s breed is completely unknown. |
color_primary |
character | The most prevalent color of an animal. |
color_secondary |
character | The second most prevalent color of an animal. |
color_tertiary |
character | The third most prevalent color of an animal. |
age |
character | The assumed age class of an animal (Baby, Young, Adult, or Senior). |
sex |
character | The sex of an animal (Female, Male, or Unknown). |
size |
character | The general size class of an animal (Small, Medium, Large, Extra Large). |
coat |
character | Coat Length for each animal (Curly, Hairless, Long, Medium, Short, Wire). |
fixed |
logical | Whether or not an animal has been spayed/neutered. |
house_trained |
logical | Whether or not an animal is trained to not go to the bathroom in the house. |
declawed |
logical | Whether or not the animal has had its dewclaws removed. |
special_needs |
logical | Whether or not the animal is considered to have special needs (this can be a long-term medical condition or particular temperament that requires extra care). |
shots_current |
logical | Whether or not the animal is up to date on all of their vaccines and other shots. |
env_children |
logical | Whether or not the animal is recommended for a home with children. |
env_dogs |
logical | Whether or not the animal is recommended for a home with other dogs. |
env_cats |
logical | Whether or not the animal is recommended for a home with cats. |
name |
character | The animal’s name (as given by the shelter/rescue). |
tags |
character | Any tags given to the dog by the shelter rescue (pipe |
photo |
character | The URL to the animal’s primary photo. |
status |
character | Whether the animal is adoptable or not. |
posted |
character | The date that this animal was first listed on PetFinder . |
contact_city |
character | The rescue/shelter’s listed city. |
contact_state |
character | The rescue/shelter’s listed state. |
contact_zip |
character | The rescue/shelter’s listed zip code. |
contact_country |
character | The rescue/shelter’s listed country. |
stateQ |
character | The state abbreviation queried in the API to return this result . |
accessed |
double | The date that this data was acquired from the PetFinder API. |
type |
character | The type of animal. |
description |
character | The full description of an animal, as entered by the rescue or shelter. This is the only field returned by the V1 API. |