Data Wrangling Midterm

Introduction

Motivation

We suggest listening to this song as background music to read this section.

“Adopt don’t shop!” We both are big dog-lovers and are very fond of the idea of adopting pets to save them from a shelter. Kelsey has a former shelter dog as a member of her family, and Jeff is considering adopting his own dog soon. With this in mind, we realized the importance of understanding what dogs are available in shelters and making that information readily available to prospective dog owners. With an easier process and the understanding that dogs that meet their specifications exist in shelters, the likelihood of shelter adoptions increases. With approximately 3.3 million dogs entering the shelter system each year in the US, it’s important to encourage and enable shelter adoption wherever possible.

Data and Methodology

In this project, we plan to use summary statistics and data exploration in R to discover trends and insights on location and availability of adoptable pets. This analysis will allow a better understanding of the pet availability. We will be able to create visualizations on this data as well.

The Petfinder dataset includes data on all adoptable dogs in the US on Petfinder.com as of 9/20/2019. The full Petfinder dataset consists of three unique datasets, listed below:

Dog Descriptions (dog_descriptions)
Dog Locations (dog_travel)
Dog Moves (dog_moves)

Additional details about the three datasets can be located in the Data Preparation tab.

Planned Analysis

With the chosen data, there are many options for analysis on the details and movement of dogs found on Petfinder. With our analysis, we wanted to utilize the data in a way that would help families find pets and also assist shelters in their process. The planned analysis can be found below:

Dog Demographics by State - This analysis will examine the typical demographics of dogs available in each state. This will include statistics such as breed, color, age, size, and other qualities commonly looked for in dog adoptions. This analysis will specifically focus on Ohio, but could be repeated for any other state.
Correlation Factors with Envy - This analysis will find the correlation between age, sex, and size when compared to envy of children, dogs, and cats. This will allow an examination of whether any of those factors tend to be correlated with envy that could lead to problems when adopted.
Import and Export Frequency - The final analysis will examine the origin of where dogs are coming from, and which states export the most and least dogs. This will be done with respect to Ohio, but could be repeated for any state.

Impact

We will consider the consumers of the analysis to be both animal shelters and those looking to adopt pets, as both can benefit from use of the analysis. The results of our dog demographics by state can be used by people looking for pets to help them understand what types of dogs they are likely to find available in their area. This can help set realistic expectations and goals for their search. The correlation between age, sex, and size, compared to envy of children, dogs, and cats, will also be helpful for hopeful pet owners. If they have a child, dog, or cat, they will know what types of dogs to filter their search to so that they find a good fit for their family. For dogs missing data on envy, this can be helpful for families to understand the general likelihood of the factor before spending time meeting the dog, which can make their process more efficient. Shelters could also use these results to preliminarily classify a dog and estimate whether it will get along with children and other animals. While they will still need to confirm the theory, these correlations could indicate what those responses might be so that they are prepared. Finally, the analysis examining the importation/exportation of dogs will be helpful to shelters. If they have a need to export a dog somewhere else, this data will give them a better understanding of which states typically accept transfers so that they can narrow their options and quickly find a solution.

Packages Used

As mentioned in the project description there are thousands of packages in R that allow for deeper analysis, more effective visualization of data, and cleaning or shaping datasets. The packages below carry much of the added functionality that we currently anticipate needing. Obviously, as we progress through our project, there will likely be more packages that we will add to the list.

Not all packages are built equally of course. Of the below packages, there are some that we will use more intensively and at different times in our analysis. For example, “tidyverse” is a package that will be helpful in formatting and cleaning our data before the analysis. “DT” and “dyplr” in “tidyverse” will help us to manipulate and shape the data as needed, while “ggplot2” will help us to visualize it.

A type of package that we currently have limited experience with, but think will be helpful as we progress through our project, are ones that allow for more in-depth formatting in publication of rmarkdown.

#library(tidyverse) #cleaning and manipulation of data
library(dplyr) #cleaning data
library(ggplot2) #visualization, plotting, and graphing of data  
library(rmarkdown) #create final report  
#library(tibble) #manipulation and visualization of data  
library(readr) #import data from excel files  
library(DT) #visualize data in table format

#Below packages are needed for mapping capabilities
library(OpenStreetMap) 
library(RColorBrewer) 
library(mapproj) 
library(sf) 
library(RgoogleMaps) 
library(scales) 
library(rworldmap) 
library(maps)
library(rnaturalearth)
library(rgeos)
library(ggspatial)
library(maptools)
library(leaflet)
library(sf)
library(tmap)
library(here)
library(rgdal)
library(scales)

Data Preparation

Source

Data selected for this project originally came from our professor, Alexandros Paparas amongst four other datasets. The data is from a GitHub. The GitHub, credits Amber Thomas and Sacha Maxim with scraping the data for their article, Finding Forever Homes from Petfinder.

As mentioned above, the data was initially used by Thomas and Maxim to write an article and create an analysis breaking down dog adoptions by state and breed. Before that even, the data is pulled from Petfinder which compiles information on dog adoptions from shelters across the United States.

Overview of the Data

The data includes all dogs listed on Petfinder on September 20, 2019. The dataset contains three separate tables:

dog_descriptions - contains 36 variables that list information on the dog
dog_travel - contains 8 variables that describe which state the dog is from and where it currently is
dog_moves - contains 5 variables that compile information from dog_travel into a table by state

dog_descriptions and dog_travel are linked by a variable called “id” that serves as a UNID.

Overall, missing data is denoted by NA. There are some variables, such as “declawed”, which are composed of records that are entirely NA. This variable is likely meant for cats and is a variable that can be either dropped or ignored. Two other variables that are essentially unnecessary are “species” and “type” as the dataset is composed entirely of dogs.

Lastly, “name” alternates between capital and lowercase letters. This is a peculiarity, but something that does not affect our analysis.

Data Importing and Cleaning

Our first step in this step was to set the working directory and import the three tables.

setwd("C:/Users/bogen/OneDrive - University of Cincinnati/Spring 2021/Data Wrangling/Project") #sets working directory
#setwd("C:/Users/Kelsey/Desktop/MS BANA/Data Wrangling/Midterm/Petfinder") #Kelsey's wd
#setwd() #user's wd

dog_descriptions <- read_csv('dog_descriptions.csv') #imports dog_descriptions table
dog_travel <- read_csv('dog_travel.csv') #imports dog_travel table
dog_moves <- read_csv('dog_moves.csv') #imports dog_moves table

dog_descriptions Table

Attacking one table at a time, we then moved to the dog_description table. In the dog_description table, we re-formatted the posted variable from time to just day. Next we corrected all variable names to the same format, changing “stateQ” to “state_q”.

head(dog_descriptions, 10) #shows first 10 lines

## # A tibble: 10 x 36
##        id org_id url   species breed_primary breed_secondary breed_mixed
##     <dbl> <chr>  <chr> <chr>   <chr>         <chr>           <lgl>      
##  1 4.60e7 NV163  http~ Dog     American Sta~ Mixed Breed     TRUE       
##  2 4.60e7 NV163  http~ Dog     Pit Bull Ter~ Mixed Breed     TRUE       
##  3 4.60e7 NV99   http~ Dog     Shepherd      NA              FALSE      
##  4 4.60e7 NV202  http~ Dog     German Sheph~ NA              FALSE      
##  5 4.60e7 NV184  http~ Dog     Dachshund     NA              FALSE      
##  6 4.60e7 NV184  http~ Dog     Boxer         Beagle          TRUE       
##  7 4.60e7 NV184  http~ Dog     Italian Grey~ Chihuahua       TRUE       
##  8 4.60e7 NV184  http~ Dog     Cattle Dog    NA              TRUE       
##  9 4.60e7 NV184  http~ Dog     Cattle Dog    NA              TRUE       
## 10 4.60e7 NV184  http~ Dog     Cattle Dog    NA              TRUE       
## # ... with 29 more variables: breed_unknown <lgl>, color_primary <chr>,
## #   color_secondary <chr>, color_tertiary <chr>, age <chr>, sex <chr>,
## #   size <chr>, coat <chr>, fixed <lgl>, house_trained <lgl>, declawed <lgl>,
## #   special_needs <lgl>, shots_current <lgl>, env_children <lgl>,
## #   env_dogs <lgl>, env_cats <lgl>, name <chr>, tags <chr>, photo <chr>,
## #   status <chr>, posted <chr>, contact_city <chr>, contact_state <chr>,
## #   contact_zip <chr>, contact_country <chr>, stateQ <chr>, accessed <chr>,
## #   type <chr>, description <chr>

dog_descriptions$posted <- strtrim(dog_descriptions$posted, 10) #trim "posted" variable
dog_descriptions$posted <- as.Date(dog_descriptions$posted) #re-format "posted" variable from time and day to just day

rename(dog_descriptions, state_q = stateQ) #changed stateQ to state_q

## # A tibble: 58,180 x 36
##        id org_id url   species breed_primary breed_secondary breed_mixed
##     <dbl> <chr>  <chr> <chr>   <chr>         <chr>           <lgl>      
##  1 4.60e7 NV163  http~ Dog     American Sta~ Mixed Breed     TRUE       
##  2 4.60e7 NV163  http~ Dog     Pit Bull Ter~ Mixed Breed     TRUE       
##  3 4.60e7 NV99   http~ Dog     Shepherd      NA              FALSE      
##  4 4.60e7 NV202  http~ Dog     German Sheph~ NA              FALSE      
##  5 4.60e7 NV184  http~ Dog     Dachshund     NA              FALSE      
##  6 4.60e7 NV184  http~ Dog     Boxer         Beagle          TRUE       
##  7 4.60e7 NV184  http~ Dog     Italian Grey~ Chihuahua       TRUE       
##  8 4.60e7 NV184  http~ Dog     Cattle Dog    NA              TRUE       
##  9 4.60e7 NV184  http~ Dog     Cattle Dog    NA              TRUE       
## 10 4.60e7 NV184  http~ Dog     Cattle Dog    NA              TRUE       
## # ... with 58,170 more rows, and 29 more variables: breed_unknown <lgl>,
## #   color_primary <chr>, color_secondary <chr>, color_tertiary <chr>,
## #   age <chr>, sex <chr>, size <chr>, coat <chr>, fixed <lgl>,
## #   house_trained <lgl>, declawed <lgl>, special_needs <lgl>,
## #   shots_current <lgl>, env_children <lgl>, env_dogs <lgl>, env_cats <lgl>,
## #   name <chr>, tags <chr>, photo <chr>, status <chr>, posted <date>,
## #   contact_city <chr>, contact_state <chr>, contact_zip <chr>,
## #   contact_country <chr>, state_q <chr>, accessed <chr>, type <chr>,
## #   description <chr>

After this, since the table contains 34 variables, we removed the variables that are not used in the analysis to keep our dataset succinct and easy to work with. We removed the following variables:

“url” - not included in analysis
“species” - every observation is ‘Dog’
“color_primary”, “color_secondary”, “color_tertiary” - unused in analysis
“coat” - unused in analysis
“fixed” - unused in analysis
“house_trained” - unused in analysis
“declawed” - all NAs
“special_needs” - unused in analysis
“shots_current” - unused in analysis
“tags” - unused in analysis
“photo” - unused in analysis
“status” - unused in analysis
“state_q” - unused in analysis
“accessed” - unused in analysis
“type” - every observation is ‘Dog’
“description” - unused in analysis

dog_descriptions <- dog_descriptions[,-c(3,4,9:11,15:20,25:27,33:36)] #removing unused variables from above

dog_travel Table

For the dog_travel table, we imputed the values from “found” into the NA observations for “manual”. This is because in the event that a city was listed for “found”, the state (or country if outside of the US) was moved to the “manual” column. In imputing, we made the most accurate “found” variable by state that we could. As such, after imputing, we changed the name to “found_state”.

head(dog_travel, 10) #shows first 10 lines

## # A tibble: 10 x 8
##        id contact_city contact_state description found manual remove still_there
##     <dbl> <chr>        <chr>         <chr>       <chr> <chr>  <lgl>  <lgl>      
##  1 4.45e7 Anoka        MN            "Boris is ~ Arka~ NA     NA     NA         
##  2 4.47e7 Groveland    FL            "Duke is a~ Abac~ Baham~ NA     NA         
##  3 4.60e7 Adamstown    MD            "Zac Woof-~ Adam  Maryl~ NA     NA         
##  4 4.45e7 Saint Cloud  MN            "~~Came in~ Adap~ NA     TRUE   NA         
##  5 4.39e7 Pueblo       CO            "Palang is~ Afgh~ NA     NA     NA         
##  6 4.31e7 Manchester   CT            "Brooke ha~ Afgh~ NA     NA     NA         
##  7 4.53e7 Wooster      OH            "Tate is a~ Akron Ohio   NA     NA         
##  8 4.53e7 Wooster      OH            "Tate is a~ Akron Ohio   NA     NA         
##  9 4.60e7 Locust Fork  AL            "Meet Trix~ Alab~ NA     NA     NA         
## 10 4.59e7 Locust Fork  AL            "Meet Reba~ Alab~ NA     NA     NA

dog_travel$manual[is.na(dog_travel$manual)] <- dog_travel$found #imputing "found" for NAs in "manual"
dog_travel <- rename(dog_travel, found_state = manual) #from manual to "found_state"

This table also contained a series of unused or incorrect variables that was complicating our analysis. As such, we rmeoved the following variables:

“description” - unused in analysis and very long text strings
“found” - removed as this was essentially replaced in the above by “found_state”
“remove” - inaccurate observations
“still_there” - inaccurate observations

dog_travel <- dog_travel[,-c(4,5,7,8)] #removing unused variables from above

dog_moves Table

For the dog_moves table, we changed “inUS” to “in_US” for consistency in variable naming

head(dog_moves, 10) #shows first 10 lines

## # A tibble: 10 x 5
##    location       exported imported total inUS 
##    <chr>             <dbl>    <dbl> <dbl> <lgl>
##  1 Texas               635       NA   566 TRUE 
##  2 Alabama             268        2  1428 TRUE 
##  3 North Carolina      158       14  2627 TRUE 
##  4 South Carolina      139       12  1618 TRUE 
##  5 Georgia             137       19  3479 TRUE 
##  6 Puerto Rico         131       NA    NA FALSE
##  7 California          130        3  1664 TRUE 
##  8 South Korea          76       NA    NA FALSE
##  9 Tennessee            66       20  1769 TRUE 
## 10 Kentucky             57        4  1123 TRUE

dog_moves <- rename(dog_moves, in_US = inUS) #changed inUS to in_US

Linking Tables

In our approach, we cleaned our data before linking in order to have the opportunity to use the clean initial tables separately, adding some flexibility. Our analysis involves linking of the dog_descriptions and dog_travel tables though which is done below; however, for our planned analysis it is not necessary to link the dog_moves table to either of them.

dog_df <- left_join(dog_descriptions, dog_travel, by = ("id")) #joins "dog_descriptions" and "dog_travel"

NAs and True/Falses

Ultimately, we made the decision to not impute any variables for any of the NAs. For the quantitative data, there were no situations where we thought it made sense in terms of our analysis to do so. For example, we did not think it appropriate to impute a mean for dogs exported out of a state when there very well may have just been no dogs exported from that state. For the qualitative data we debated changing some NAs to Unknown; however, we again did not foresee that being beneficial to our analysis at this point in the process. We do however intend to use the na.omit function when applicable. For example, when analyzing the factors that might contribute to envious of cats, dogs, or children.

We also considered changing True’s and False’s to Yes’s and No’s. This also seemed like an extraneous detail though as we believe True and False convey the same message.

Cleaned Dataset

Combined Table

head(dog_df, 150)

## # A tibble: 150 x 21
##        id org_id breed_primary breed_secondary breed_mixed breed_unknown age  
##     <dbl> <chr>  <chr>         <chr>           <lgl>       <lgl>         <chr>
##  1 4.60e7 NV163  American Sta~ Mixed Breed     TRUE        FALSE         Seni~
##  2 4.60e7 NV163  Pit Bull Ter~ Mixed Breed     TRUE        FALSE         Adult
##  3 4.60e7 NV99   Shepherd      NA              FALSE       FALSE         Adult
##  4 4.60e7 NV202  German Sheph~ NA              FALSE       FALSE         Baby 
##  5 4.60e7 NV184  Dachshund     NA              FALSE       FALSE         Young
##  6 4.60e7 NV184  Boxer         Beagle          TRUE        FALSE         Baby 
##  7 4.60e7 NV184  Italian Grey~ Chihuahua       TRUE        FALSE         Baby 
##  8 4.60e7 NV184  Cattle Dog    NA              TRUE        FALSE         Baby 
##  9 4.60e7 NV184  Cattle Dog    NA              TRUE        FALSE         Baby 
## 10 4.60e7 NV184  Cattle Dog    NA              TRUE        FALSE         Baby 
## # ... with 140 more rows, and 14 more variables: sex <chr>, size <chr>,
## #   env_children <lgl>, env_dogs <lgl>, env_cats <lgl>, name <chr>,
## #   posted <date>, contact_city.x <chr>, contact_state.x <chr>,
## #   contact_zip <chr>, contact_country <chr>, contact_city.y <chr>,
## #   contact_state.y <chr>, found_state <chr>

It is worth noting that there are many more observations in the dog_descriptions than there were in the dog_travel table. This means that there are many NAs in the dog_df dataframe for the variables from the dog_travel table (“contact_city”, “contact_state”, “found_state”). We took the same approach to these NAs as mentioned previously.

dog_moves

head(dog_moves, 10)

## # A tibble: 10 x 5
##    location       exported imported total in_US
##    <chr>             <dbl>    <dbl> <dbl> <lgl>
##  1 Texas               635       NA   566 TRUE 
##  2 Alabama             268        2  1428 TRUE 
##  3 North Carolina      158       14  2627 TRUE 
##  4 South Carolina      139       12  1618 TRUE 
##  5 Georgia             137       19  3479 TRUE 
##  6 Puerto Rico         131       NA    NA FALSE
##  7 California          130        3  1664 TRUE 
##  8 South Korea          76       NA    NA FALSE
##  9 Tennessee            66       20  1769 TRUE 
## 10 Kentucky             57        4  1123 TRUE

Data Overview

Focusing on on our two “cleaned” tables, dog_df and dog_moves, we can see some basic analysis below.

For dog_df, we see that there are now 21 variables and 60,259 observations contained in our new dataframe.

dim(dog_df)

## [1] 60259    21

The structure of our data frame can be seen below with 1 numeric, 5 logical, and 15 character variables.

str(dog_df)

## Classes 'tbl_df', 'tbl' and 'data.frame':    60259 obs. of  21 variables:
##  $ id             : num  4.6e+07 4.6e+07 4.6e+07 4.6e+07 4.6e+07 ...
##  $ org_id         : chr  "NV163" "NV163" "NV99" "NV202" ...
##  $ breed_primary  : chr  "American Staffordshire Terrier" "Pit Bull Terrier" "Shepherd" "German Shepherd Dog" ...
##  $ breed_secondary: chr  "Mixed Breed" "Mixed Breed" NA NA ...
##  $ breed_mixed    : logi  TRUE TRUE FALSE FALSE FALSE TRUE ...
##  $ breed_unknown  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ age            : chr  "Senior" "Adult" "Adult" "Baby" ...
##  $ sex            : chr  "Male" "Male" "Male" "Female" ...
##  $ size           : chr  "Medium" "Large" "Large" "Large" ...
##  $ env_children   : logi  NA NA NA NA TRUE TRUE ...
##  $ env_dogs       : logi  NA NA NA NA TRUE TRUE ...
##  $ env_cats       : logi  NA NA NA NA TRUE TRUE ...
##  $ name           : chr  "HARLEY" "BIGGIE" "Ziggy" "Gypsy" ...
##  $ posted         : Date, format: "2019-09-20" "2019-09-20" ...
##  $ contact_city.x : chr  "Las Vegas" "Las Vegas" "Mesquite" "Pahrump" ...
##  $ contact_state.x: chr  "NV" "NV" "NV" "NV" ...
##  $ contact_zip    : chr  "89147" "89147" "89027" "89048" ...
##  $ contact_country: chr  "US" "US" "US" "US" ...
##  $ contact_city.y : chr  NA NA NA NA ...
##  $ contact_state.y: chr  NA NA NA NA ...
##  $ found_state    : chr  NA NA NA NA ...

Focusing on our “env_…” as these are a key part of our proposed analysis, we can see the number of dogs who are envious of children, dogs, and cats below respectively.

summary(dog_df$env_children) #envious of children

##    Mode   FALSE    TRUE    NA's 
## logical    4558   24364   31337

summary(dog_df$env_dogs) #envious of dogs

##    Mode   FALSE    TRUE    NA's 
## logical    3623   32498   24138

summary(dog_df$env_cats) #envious of cats

##    Mode   FALSE    TRUE    NA's 
## logical    7062   12835   40362

Interestingly, dogs being envious of other dogs is the most common issue, followed by children, and then cats lastly.

For the dog_moves table, we can see that there are now 5 variables with 90 observations.

dim(dog_moves)

## [1] 90  5

Of these variables, 1 is character, 3 are numeric, and 1 is logical.

str(dog_moves)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 90 obs. of  5 variables:
##  $ location: chr  "Texas" "Alabama" "North Carolina" "South Carolina" ...
##  $ exported: num  635 268 158 139 137 131 130 76 66 57 ...
##  $ imported: num  NA 2 14 12 19 NA 3 NA 20 4 ...
##  $ total   : num  566 1428 2627 1618 3479 ...
##  $ in_US   : logi  TRUE TRUE TRUE TRUE TRUE FALSE ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   location = col_character(),
##   ..   exported = col_double(),
##   ..   imported = col_double(),
##   ..   total = col_double(),
##   ..   inUS = col_logical()
##   .. )

Proposed EDA

Discovery

For the uncovery of of new information in the data, we plan to utilize all three datasets in our analysis. Our examination of dog demographics by state will allow us to summarize variables by state and find distributions of the data. To look at each dog holistically, we plan to join the dog_description and dog_travel tables based on the unique identifier for each dog. This will allow us to find the following information for dogs in the state of Ohio:
* Breeds that are most available, based on counts of the breed_primary variable. * Typical age of available dogs, taken from the age variable.
* Distribution of dog size between small, medium, large, and extra large, which we will find in the size variable.
* The percentage of dogs that are mixed breeds or not, taken from the breed_mixed binary variable.
* Which shelters have the highest count of dogs available and would be the most beneficial to visit? This will come from the org_id variable.
* Cities with the most available dogs, from the contact_city variable. Are the above summary statistics similar for the city of Cincinnati versus the larger sample size of Ohio?

We also plan to analyze how different variables correlate with dogs’ envy of children, dogs, and cats. This will be done by creating a correlation matrix between age, sex, and size and env_children, env_dogs, and env_cats. We will also run a correlation test between each of the variables to find the specific correlation value.

Finally, we will analyze the movement of dogs based on the dog_travel table. This will involve an analysis to find any trends in movement, analyzing the frequency of certain origins for each state. We will also utilize a map in R to display the density of available dogs in each state with the OpenStreetMap package.

Plots and Tables

For the initial demographic analysis, plots and tables will be very prevalent in our results. For the count of breed type, age range, mixed breeds, shelter count, and city analysis, we plan to output tables showing the frequency of each response. We will also utilize barplots to visualize the frequency of each. A correlation matrix will be essential for our envy analysis in visualizing the correlation between the desired variables. Additionally, a geo-spatial map plot will be used to show frequency of dogs across the country.

Required Learning

Most of what is needed for our analysis has been covered in this course or in one of the other courses in R that we have already taken. The main learning that we will need to do before completing the project is how to work with geo-spatial data in R to create the desired map. We have never worked with maps in R, so we will need to learn how to properly utilize OpenStreetMap and other required packages to properly display the data how we want it on a map of the US.

Machine Learning Techniques

The incorporation of linear regression into our analysis will depend on preliminary results of the correlation values when we examine variables’ correlation with envy. If we find that there is strong correlation between any of the variables, that would prompt us to create a regression model to find more information on how envy is typically affected by other variables present.