Introduction

In this report, we explore Petfinder data. Petfinder is an online database of animals who need homes. Using the PetFinder API and R, we downloaded data available for the city of Los Angeles, CA by April 2023. Using data wrangling techniques and visualization, we show that:

(Include the key three stories about your data here)

A glimpse of the dataset is shown below:

library(tidyverse)
library(here)
library(skimr)
library(janitor)
library(lubridate) # to deal with date and time data
library(kableExtra) # Nicely-formatted tables (you don't need this; it's just trying this option)
dogs <- read_csv(here("data","dogs_LA_data.csv.csv")) # dogs looking for forever homes in LA (via Petfinder.com)

head(dogs)
## # A tibble: 6 x 49
##         id organization_id url      type  species age   gender size  coat  tags 
##      <dbl> <chr>           <chr>    <chr> <chr>   <chr> <chr>  <chr> <chr> <chr>
## 1 62679064 CA448           https:/~ Dog   Dog     Young Female Medi~ <NA>  "cha~
## 2 62679056 CA1763          https:/~ Cat   Cat     Young Female Medi~ <NA>  "cha~
## 3 62679054 CA190           https:/~ Cat   Cat     Baby  Female Medi~ <NA>  "cha~
## 4 62679044 CA2713          https:/~ Cat   Cat     Baby  Female Medi~ Short "c(\~
## 5 62679023 CA2713          https:/~ Dog   Dog     Baby  Female Small Medi~ "c(\~
## 6 62679007 CA2713          https:/~ Dog   Dog     Adult Male   Small Short "c(\~
## # i 39 more variables: name <chr>, description <chr>,
## #   organization_animal_id <chr>, photos <chr>, videos <chr>, status <chr>,
## #   status_changed_at <dttm>, published_at <dttm>, distance <dbl>,
## #   breeds.primary <chr>, breeds.secondary <chr>, breeds.mixed <lgl>,
## #   breeds.unknown <lgl>, colors.primary <chr>, colors.secondary <chr>,
## #   colors.tertiary <chr>, attributes.spayed_neutered <lgl>,
## #   attributes.house_trained <lgl>, attributes.declawed <lgl>, ...

Data Preparation

Before conducting any analysis, data processing was performed. In this stage, the dataset was cleaned and prepared for easy manipulation.

tidydogs <- dogs %>% 
  filter(type == "Dog") %>% 
  clean_names() %>% # Careful with the autopredict in the next lines, because column names might change
  rename(house_trained = attributes_house_trained,
         spayed_neutered = attributes_spayed_neutered,
         declawed = attributes_declawed,
         spatial_needs = attributes_special_needs) %>% 
  select(id, age, gender, type, species, name, breeds_primary, distance, published_at, status, description, contact_address_address1, contact_address_city)

names(tidydogs) # list the new column names
##  [1] "id"                       "age"                     
##  [3] "gender"                   "type"                    
##  [5] "species"                  "name"                    
##  [7] "breeds_primary"           "distance"                
##  [9] "published_at"             "status"                  
## [11] "description"              "contact_address_address1"
## [13] "contact_address_city"

Since the column names were already in good shape, it just changed a few (.) to (_). The original dataset has 49 columns with information about the dogs, and only 14 columns were selected for this study.

Analysis

As part of the data exploration, I plotted the total number of dogs looking for forever home by age and gender, but only including my favorite breeds. I started by creating that list.

#Create a list of favorite/known breeds
fav_breeds <- c("Labrador Retriever", 
                "Husky", 
                "Shih Tzu", 
                "Belgian Shepherd / Malinois", 
                "Siberian Husky", 
                "Dutch Shepherd", 
                "Golden Retriever", 
                "Belgian Shepherd / Laekenois", 
                "German Shepherd Dog", "Shepherd", "Dachshund", 
                "Mixed Breed", 
                "Australian Shepherd", 
                "Akita")

For plotting, I use the ggplot package contained in tidyverse and the geom_count function that is used for discrete variables, and it creates a bubbleplot. See more info here. geom_count creates a new dataframe on the basis of the dataframe ‘tidydogs.’ As stated in the manual: “A way of using the variables that geom_count has created, is by referring to them with the “..” before, and after. By default, the counts will be used (..count..), but we can also refer to the proportions in the created dataframe by using: size=..prop... (we specify group=class to signal to geom_count that we want the proportions within class) Below you can see the dataframe that was created through geom_count

#create plot using proportions by breed
tidydogs %>% 
  filter(breeds_primary %in% fav_breeds) %>% # using only the breeds_primary attribute for this selection
  ggplot(aes(x = age,  y = breeds_primary, color = gender)) +
  geom_count(aes(size=..prop.., group = breeds_primary)) + 
  scale_size_area() + # ensures that counts of zero would be given size 0--learned this reading the documentation for geom_count :)
  facet_wrap(~ gender)

I compared these results with the ones obtained while using the default counts instead of prop (proportions)

#create plot using total counts
tidydogs %>% 
  filter(breeds_primary %in% fav_breeds) %>% 
  ggplot(aes(x = age,  y = breeds_primary, color = gender)) +
  geom_count(aes(group = breeds_primary)) + 
  scale_size_area() + # ensures that counts of zero would be given size 0--learned this reading the documentation for geom_count :)
  facet_wrap(~ gender)
Figure 2. Total number of dogs by breed and age. The plot shows only 13 of my favorite dogs breeds. From all breeds, female mixed-breed puppies makes the highest population with 30 dogs looking for forever home, followed by the group of young male huskies

Figure 2. Total number of dogs by breed and age. The plot shows only 13 of my favorite dogs breeds. From all breeds, female mixed-breed puppies makes the highest population with 30 dogs looking for forever home, followed by the group of young male huskies

Then, I compared the results with the actual data, by doing some data wrangling as used in lesson 2.

tidydogs %>% 
  filter(breeds_primary %in% fav_breeds) %>% 
  group_by(breeds_primary, age, gender) %>% 
  count() %>% 
  rename(count = n) %>% 
  arrange(-count) %>%
  head() %>%
  #You don't NEED the following two lines. They are just to give the table a better format in the html. See more ideas here: https://bookdown.org/yihui/rmarkdown-cookbook/kableextra.html
  kable(format = "html") %>% 
  kable_styling(font_size = 11)
breeds_primary age gender count
Mixed Breed Baby Female 30
Husky Young Male 29
Labrador Retriever Young Male 20
German Shepherd Dog Young Male 16
German Shepherd Dog Adult Female 13
German Shepherd Dog Baby Male 13

Based on these results, both plots are helpful for decision-making. The second plot that uses count and not proportions is helpful to select across breeds, and the first plot would be useful for making the decision while comparing results just within a specific breed but still having a whole picture of the options.

tidydogs %>% 
  na.omit() %>% 
  mutate(logdistance = log(distance)) %>% 
  ggplot(aes(x = breeds_primary, y = logdistance, fill = age)) +
  geom_boxplot() +
  coord_flip()

datadownload_time <- "2023-04-30 22:15:00"
ref_time <- as_datetime(datadownload_time) # convert to data-time format

# Get time difference in minutes to have another numerical attribute
tidydogs <- tidydogs %>% 
  mutate(ava_time = as_datetime(published_at)) %>%
  mutate(timediff = as.numeric(difftime(ref_time, ava_time))) #get time difference in minutes
tidydogs %>% 
  filter(breeds_primary %in% fav_breeds) %>%  
  ggplot(aes(x = age, fill = gender)) +
  geom_bar() +
  facet_wrap(~breeds_primary)

That’s all for this week!