We suggest listening to this song as background music to read this section.
“Adopt don’t shop!” We both are big dog-lovers and are very fond of the idea of adopting pets to save them from a shelter. Kelsey has a former shelter dog as a member of her family, and Jeff is considering adopting his own dog soon. With this in mind, we realized the importance of understanding what dogs are available in shelters and making that information readily available to prospective dog owners. With an easier process and the understanding that dogs that meet their specifications exist in shelters, the likelihood of shelter adoptions increases. With approximately 3.3 million dogs entering the shelter system each year in the US, it’s important to encourage and enable shelter adoption wherever possible.
In this project, we plan to use summary statistics and data exploration in R to discover trends and insights on location and availability of adoptable pets. This analysis will allow a better understanding of the pet availability. We will be able to create visualizations on this data as well.
The Petfinder dataset includes data on all adoptable dogs in the US on Petfinder.com as of 9/20/2019. The full Petfinder dataset consists of three unique datasets, listed below:
dog_descriptions
)dog_travel
)dog_moves
)Additional details about the three datasets can be located in the Data Preparation tab.
With the chosen data, there are many options for analysis on the details and movement of dogs found on Petfinder. With our analysis, we wanted to utilize the data in a way that would help families find pets and also assist shelters in their process. The planned analysis can be found below:
We will consider the consumers of the analysis to be both animal shelters and those looking to adopt pets, as both can benefit from use of the analysis. The results of our dog demographics by state can be used by people looking for pets to help them understand what types of dogs they are likely to find available in their area. This can help set realistic expectations and goals for their search. The correlation between age, sex, and size, compared to envy of children, dogs, and cats, will also be helpful for hopeful pet owners. If they have a child, dog, or cat, they will know what types of dogs to filter their search to so that they find a good fit for their family. For dogs missing data on envy, this can be helpful for families to understand the general likelihood of the factor before spending time meeting the dog, which can make their process more efficient. Shelters could also use these results to preliminarily classify a dog and estimate whether it will get along with children and other animals. While they will still need to confirm the theory, these correlations could indicate what those responses might be so that they are prepared. Finally, the analysis examining the importation/exportation of dogs will be helpful to shelters. If they have a need to export a dog somewhere else, this data will give them a better understanding of which states typically accept transfers so that they can narrow their options and quickly find a solution.
As mentioned in the project description there are thousands of packages in R that allow for deeper analysis, more effective visualization of data, and cleaning or shaping datasets. The packages below carry much of the added functionality that we currently anticipate needing. Obviously, as we progress through our project, there will likely be more packages that we will add to the list.
Not all packages are built equally of course. Of the below packages, there are some that we will use more intensively and at different times in our analysis. For example, “tidyverse” is a package that will be helpful in formatting and cleaning our data before the analysis. “DT” and “dyplr” in “tidyverse” will help us to manipulate and shape the data as needed, while “ggplot2” will help us to visualize it.
A type of package that we currently have limited experience with, but think will be helpful as we progress through our project, are ones that allow for more in-depth formatting in publication of rmarkdown.
#library(tidyverse) #cleaning and manipulation of data
library(dplyr) #cleaning data
library(ggplot2) #visualization, plotting, and graphing of data
library(rmarkdown) #create final report
#library(tibble) #manipulation and visualization of data
library(readr) #import data from excel files
library(DT) #visualize data in table format
#Below packages are needed for mapping capabilities
library(OpenStreetMap)
library(RColorBrewer)
library(mapproj)
library(sf)
library(RgoogleMaps)
library(scales)
library(rworldmap)
library(maps)
library(rnaturalearth)
library(rgeos)
library(ggspatial)
library(maptools)
library(leaflet)
library(sf)
library(tmap)
library(here)
library(rgdal)
library(scales)
Data selected for this project originally came from our professor, Alexandros Paparas amongst four other datasets. The data is from a GitHub. The GitHub, credits Amber Thomas and Sacha Maxim with scraping the data for their article, Finding Forever Homes from Petfinder.
As mentioned above, the data was initially used by Thomas and Maxim to write an article and create an analysis breaking down dog adoptions by state and breed. Before that even, the data is pulled from Petfinder which compiles information on dog adoptions from shelters across the United States.
The data includes all dogs listed on Petfinder on September 20, 2019. The dataset contains three separate tables:
dog_descriptions and dog_travel are linked by a variable called “id” that serves as a UNID.
Overall, missing data is denoted by NA. There are some variables, such as “declawed”, which are composed of records that are entirely NA. This variable is likely meant for cats and is a variable that can be either dropped or ignored. Two other variables that are essentially unnecessary are “species” and “type” as the dataset is composed entirely of dogs.
Lastly, “name” alternates between capital and lowercase letters. This is a peculiarity, but something that does not affect our analysis.
Our first step in this step was to set the working directory and import the three tables.
setwd("C:/Users/bogen/OneDrive - University of Cincinnati/Spring 2021/Data Wrangling/Project") #sets working directory
#setwd("C:/Users/Kelsey/Desktop/MS BANA/Data Wrangling/Midterm/Petfinder") #Kelsey's wd
#setwd() #user's wd
dog_descriptions <- read_csv('dog_descriptions.csv') #imports dog_descriptions table
dog_travel <- read_csv('dog_travel.csv') #imports dog_travel table
dog_moves <- read_csv('dog_moves.csv') #imports dog_moves table
Attacking one table at a time, we then moved to the dog_description table. In the dog_description table, we re-formatted the posted variable from time to just day. Next we corrected all variable names to the same format, changing “stateQ” to “state_q”.
head(dog_descriptions, 10) #shows first 10 lines
## # A tibble: 10 x 36
## id org_id url species breed_primary breed_secondary breed_mixed
## <dbl> <chr> <chr> <chr> <chr> <chr> <lgl>
## 1 4.60e7 NV163 http~ Dog American Sta~ Mixed Breed TRUE
## 2 4.60e7 NV163 http~ Dog Pit Bull Ter~ Mixed Breed TRUE
## 3 4.60e7 NV99 http~ Dog Shepherd NA FALSE
## 4 4.60e7 NV202 http~ Dog German Sheph~ NA FALSE
## 5 4.60e7 NV184 http~ Dog Dachshund NA FALSE
## 6 4.60e7 NV184 http~ Dog Boxer Beagle TRUE
## 7 4.60e7 NV184 http~ Dog Italian Grey~ Chihuahua TRUE
## 8 4.60e7 NV184 http~ Dog Cattle Dog NA TRUE
## 9 4.60e7 NV184 http~ Dog Cattle Dog NA TRUE
## 10 4.60e7 NV184 http~ Dog Cattle Dog NA TRUE
## # ... with 29 more variables: breed_unknown <lgl>, color_primary <chr>,
## # color_secondary <chr>, color_tertiary <chr>, age <chr>, sex <chr>,
## # size <chr>, coat <chr>, fixed <lgl>, house_trained <lgl>, declawed <lgl>,
## # special_needs <lgl>, shots_current <lgl>, env_children <lgl>,
## # env_dogs <lgl>, env_cats <lgl>, name <chr>, tags <chr>, photo <chr>,
## # status <chr>, posted <chr>, contact_city <chr>, contact_state <chr>,
## # contact_zip <chr>, contact_country <chr>, stateQ <chr>, accessed <chr>,
## # type <chr>, description <chr>
dog_descriptions$posted <- strtrim(dog_descriptions$posted, 10) #trim "posted" variable
dog_descriptions$posted <- as.Date(dog_descriptions$posted) #re-format "posted" variable from time and day to just day
rename(dog_descriptions, state_q = stateQ) #changed stateQ to state_q
## # A tibble: 58,180 x 36
## id org_id url species breed_primary breed_secondary breed_mixed
## <dbl> <chr> <chr> <chr> <chr> <chr> <lgl>
## 1 4.60e7 NV163 http~ Dog American Sta~ Mixed Breed TRUE
## 2 4.60e7 NV163 http~ Dog Pit Bull Ter~ Mixed Breed TRUE
## 3 4.60e7 NV99 http~ Dog Shepherd NA FALSE
## 4 4.60e7 NV202 http~ Dog German Sheph~ NA FALSE
## 5 4.60e7 NV184 http~ Dog Dachshund NA FALSE
## 6 4.60e7 NV184 http~ Dog Boxer Beagle TRUE
## 7 4.60e7 NV184 http~ Dog Italian Grey~ Chihuahua TRUE
## 8 4.60e7 NV184 http~ Dog Cattle Dog NA TRUE
## 9 4.60e7 NV184 http~ Dog Cattle Dog NA TRUE
## 10 4.60e7 NV184 http~ Dog Cattle Dog NA TRUE
## # ... with 58,170 more rows, and 29 more variables: breed_unknown <lgl>,
## # color_primary <chr>, color_secondary <chr>, color_tertiary <chr>,
## # age <chr>, sex <chr>, size <chr>, coat <chr>, fixed <lgl>,
## # house_trained <lgl>, declawed <lgl>, special_needs <lgl>,
## # shots_current <lgl>, env_children <lgl>, env_dogs <lgl>, env_cats <lgl>,
## # name <chr>, tags <chr>, photo <chr>, status <chr>, posted <date>,
## # contact_city <chr>, contact_state <chr>, contact_zip <chr>,
## # contact_country <chr>, state_q <chr>, accessed <chr>, type <chr>,
## # description <chr>
After this, since the table contains 34 variables, we removed the variables that are not used in the analysis to keep our dataset succinct and easy to work with. We removed the following variables:
dog_descriptions <- dog_descriptions[,-c(3,4,9:11,15:20,25:27,33:36)] #removing unused variables from above
For the dog_travel table, we imputed the values from “found” into the NA observations for “manual”. This is because in the event that a city was listed for “found”, the state (or country if outside of the US) was moved to the “manual” column. In imputing, we made the most accurate “found” variable by state that we could. As such, after imputing, we changed the name to “found_state”.
head(dog_travel, 10) #shows first 10 lines
## # A tibble: 10 x 8
## id contact_city contact_state description found manual remove still_there
## <dbl> <chr> <chr> <chr> <chr> <chr> <lgl> <lgl>
## 1 4.45e7 Anoka MN "Boris is ~ Arka~ NA NA NA
## 2 4.47e7 Groveland FL "Duke is a~ Abac~ Baham~ NA NA
## 3 4.60e7 Adamstown MD "Zac Woof-~ Adam Maryl~ NA NA
## 4 4.45e7 Saint Cloud MN "~~Came in~ Adap~ NA TRUE NA
## 5 4.39e7 Pueblo CO "Palang is~ Afgh~ NA NA NA
## 6 4.31e7 Manchester CT "Brooke ha~ Afgh~ NA NA NA
## 7 4.53e7 Wooster OH "Tate is a~ Akron Ohio NA NA
## 8 4.53e7 Wooster OH "Tate is a~ Akron Ohio NA NA
## 9 4.60e7 Locust Fork AL "Meet Trix~ Alab~ NA NA NA
## 10 4.59e7 Locust Fork AL "Meet Reba~ Alab~ NA NA NA
dog_travel$manual[is.na(dog_travel$manual)] <- dog_travel$found #imputing "found" for NAs in "manual"
dog_travel <- rename(dog_travel, found_state = manual) #from manual to "found_state"
This table also contained a series of unused or incorrect variables that was complicating our analysis. As such, we rmeoved the following variables:
dog_travel <- dog_travel[,-c(4,5,7,8)] #removing unused variables from above
For the dog_moves table, we changed “inUS” to “in_US” for consistency in variable naming
head(dog_moves, 10) #shows first 10 lines
## # A tibble: 10 x 5
## location exported imported total inUS
## <chr> <dbl> <dbl> <dbl> <lgl>
## 1 Texas 635 NA 566 TRUE
## 2 Alabama 268 2 1428 TRUE
## 3 North Carolina 158 14 2627 TRUE
## 4 South Carolina 139 12 1618 TRUE
## 5 Georgia 137 19 3479 TRUE
## 6 Puerto Rico 131 NA NA FALSE
## 7 California 130 3 1664 TRUE
## 8 South Korea 76 NA NA FALSE
## 9 Tennessee 66 20 1769 TRUE
## 10 Kentucky 57 4 1123 TRUE
dog_moves <- rename(dog_moves, in_US = inUS) #changed inUS to in_US
In our approach, we cleaned our data before linking in order to have the opportunity to use the clean initial tables separately, adding some flexibility. Our analysis involves linking of the dog_descriptions and dog_travel tables though which is done below; however, for our planned analysis it is not necessary to link the dog_moves table to either of them.
dog_df <- left_join(dog_descriptions, dog_travel, by = ("id")) #joins "dog_descriptions" and "dog_travel"
Ultimately, we made the decision to not impute any variables for any of the NAs. For the quantitative data, there were no situations where we thought it made sense in terms of our analysis to do so. For example, we did not think it appropriate to impute a mean for dogs exported out of a state when there very well may have just been no dogs exported from that state. For the qualitative data we debated changing some NAs to Unknown; however, we again did not foresee that being beneficial to our analysis at this point in the process. We do however intend to use the na.omit function when applicable. For example, when analyzing the factors that might contribute to envious of cats, dogs, or children.
We also considered changing True’s and False’s to Yes’s and No’s. This also seemed like an extraneous detail though as we believe True and False convey the same message.
head(dog_df, 150)
## # A tibble: 150 x 21
## id org_id breed_primary breed_secondary breed_mixed breed_unknown age
## <dbl> <chr> <chr> <chr> <lgl> <lgl> <chr>
## 1 4.60e7 NV163 American Sta~ Mixed Breed TRUE FALSE Seni~
## 2 4.60e7 NV163 Pit Bull Ter~ Mixed Breed TRUE FALSE Adult
## 3 4.60e7 NV99 Shepherd NA FALSE FALSE Adult
## 4 4.60e7 NV202 German Sheph~ NA FALSE FALSE Baby
## 5 4.60e7 NV184 Dachshund NA FALSE FALSE Young
## 6 4.60e7 NV184 Boxer Beagle TRUE FALSE Baby
## 7 4.60e7 NV184 Italian Grey~ Chihuahua TRUE FALSE Baby
## 8 4.60e7 NV184 Cattle Dog NA TRUE FALSE Baby
## 9 4.60e7 NV184 Cattle Dog NA TRUE FALSE Baby
## 10 4.60e7 NV184 Cattle Dog NA TRUE FALSE Baby
## # ... with 140 more rows, and 14 more variables: sex <chr>, size <chr>,
## # env_children <lgl>, env_dogs <lgl>, env_cats <lgl>, name <chr>,
## # posted <date>, contact_city.x <chr>, contact_state.x <chr>,
## # contact_zip <chr>, contact_country <chr>, contact_city.y <chr>,
## # contact_state.y <chr>, found_state <chr>
It is worth noting that there are many more observations in the dog_descriptions than there were in the dog_travel table. This means that there are many NAs in the dog_df dataframe for the variables from the dog_travel table (“contact_city”, “contact_state”, “found_state”). We took the same approach to these NAs as mentioned previously.
head(dog_moves, 10)
## # A tibble: 10 x 5
## location exported imported total in_US
## <chr> <dbl> <dbl> <dbl> <lgl>
## 1 Texas 635 NA 566 TRUE
## 2 Alabama 268 2 1428 TRUE
## 3 North Carolina 158 14 2627 TRUE
## 4 South Carolina 139 12 1618 TRUE
## 5 Georgia 137 19 3479 TRUE
## 6 Puerto Rico 131 NA NA FALSE
## 7 California 130 3 1664 TRUE
## 8 South Korea 76 NA NA FALSE
## 9 Tennessee 66 20 1769 TRUE
## 10 Kentucky 57 4 1123 TRUE
Focusing on on our two “cleaned” tables, dog_df and dog_moves, we can see some basic analysis below.
For dog_df, we see that there are now 21 variables and 60,259 observations contained in our new dataframe.
dim(dog_df)
## [1] 60259 21
The structure of our data frame can be seen below with 1 numeric, 5 logical, and 15 character variables.
str(dog_df)
## Classes 'tbl_df', 'tbl' and 'data.frame': 60259 obs. of 21 variables:
## $ id : num 4.6e+07 4.6e+07 4.6e+07 4.6e+07 4.6e+07 ...
## $ org_id : chr "NV163" "NV163" "NV99" "NV202" ...
## $ breed_primary : chr "American Staffordshire Terrier" "Pit Bull Terrier" "Shepherd" "German Shepherd Dog" ...
## $ breed_secondary: chr "Mixed Breed" "Mixed Breed" NA NA ...
## $ breed_mixed : logi TRUE TRUE FALSE FALSE FALSE TRUE ...
## $ breed_unknown : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ age : chr "Senior" "Adult" "Adult" "Baby" ...
## $ sex : chr "Male" "Male" "Male" "Female" ...
## $ size : chr "Medium" "Large" "Large" "Large" ...
## $ env_children : logi NA NA NA NA TRUE TRUE ...
## $ env_dogs : logi NA NA NA NA TRUE TRUE ...
## $ env_cats : logi NA NA NA NA TRUE TRUE ...
## $ name : chr "HARLEY" "BIGGIE" "Ziggy" "Gypsy" ...
## $ posted : Date, format: "2019-09-20" "2019-09-20" ...
## $ contact_city.x : chr "Las Vegas" "Las Vegas" "Mesquite" "Pahrump" ...
## $ contact_state.x: chr "NV" "NV" "NV" "NV" ...
## $ contact_zip : chr "89147" "89147" "89027" "89048" ...
## $ contact_country: chr "US" "US" "US" "US" ...
## $ contact_city.y : chr NA NA NA NA ...
## $ contact_state.y: chr NA NA NA NA ...
## $ found_state : chr NA NA NA NA ...
Focusing on our “env_…” as these are a key part of our proposed analysis, we can see the number of dogs who are envious of children, dogs, and cats below respectively.
summary(dog_df$env_children) #envious of children
## Mode FALSE TRUE NA's
## logical 4558 24364 31337
summary(dog_df$env_dogs) #envious of dogs
## Mode FALSE TRUE NA's
## logical 3623 32498 24138
summary(dog_df$env_cats) #envious of cats
## Mode FALSE TRUE NA's
## logical 7062 12835 40362
Interestingly, dogs being envious of other dogs is the most common issue, followed by children, and then cats lastly.
For the dog_moves table, we can see that there are now 5 variables with 90 observations.
dim(dog_moves)
## [1] 90 5
Of these variables, 1 is character, 3 are numeric, and 1 is logical.
str(dog_moves)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 90 obs. of 5 variables:
## $ location: chr "Texas" "Alabama" "North Carolina" "South Carolina" ...
## $ exported: num 635 268 158 139 137 131 130 76 66 57 ...
## $ imported: num NA 2 14 12 19 NA 3 NA 20 4 ...
## $ total : num 566 1428 2627 1618 3479 ...
## $ in_US : logi TRUE TRUE TRUE TRUE TRUE FALSE ...
## - attr(*, "spec")=
## .. cols(
## .. location = col_character(),
## .. exported = col_double(),
## .. imported = col_double(),
## .. total = col_double(),
## .. inUS = col_logical()
## .. )
For the uncovery of of new information in the data, we plan to utilize all three datasets in our analysis. Our examination of dog demographics by state will allow us to summarize variables by state and find distributions of the data. To look at each dog holistically, we plan to join the dog_description
and dog_travel
tables based on the unique identifier for each dog. This will allow us to find the following information for dogs in the state of Ohio:
* Breeds that are most available, based on counts of the breed_primary variable. * Typical age of available dogs, taken from the age variable.
* Distribution of dog size between small, medium, large, and extra large, which we will find in the size variable.
* The percentage of dogs that are mixed breeds or not, taken from the breed_mixed binary variable.
* Which shelters have the highest count of dogs available and would be the most beneficial to visit? This will come from the org_id variable.
* Cities with the most available dogs, from the contact_city variable. Are the above summary statistics similar for the city of Cincinnati versus the larger sample size of Ohio?
We also plan to analyze how different variables correlate with dogs’ envy of children, dogs, and cats. This will be done by creating a correlation matrix between age, sex, and size and env_children, env_dogs, and env_cats. We will also run a correlation test between each of the variables to find the specific correlation value.
Finally, we will analyze the movement of dogs based on the dog_travel
table. This will involve an analysis to find any trends in movement, analyzing the frequency of certain origins for each state. We will also utilize a map in R to display the density of available dogs in each state with the OpenStreetMap package.
For the initial demographic analysis, plots and tables will be very prevalent in our results. For the count of breed type, age range, mixed breeds, shelter count, and city analysis, we plan to output tables showing the frequency of each response. We will also utilize barplots to visualize the frequency of each. A correlation matrix will be essential for our envy analysis in visualizing the correlation between the desired variables. Additionally, a geo-spatial map plot will be used to show frequency of dogs across the country.
Most of what is needed for our analysis has been covered in this course or in one of the other courses in R that we have already taken. The main learning that we will need to do before completing the project is how to work with geo-spatial data in R to create the desired map. We have never worked with maps in R, so we will need to learn how to properly utilize OpenStreetMap and other required packages to properly display the data how we want it on a map of the US.
The incorporation of linear regression into our analysis will depend on preliminary results of the correlation values when we examine variables’ correlation with envy. If we find that there is strong correlation between any of the variables, that would prompt us to create a regression model to find more information on how envy is typically affected by other variables present.