In this report, we explore Petfinder data. Petfinder is an online database of animals who need homes. Using the PetFinder API and R, we downloaded data available for the city of Los Angeles, CA by April 2023. Using data wrangling techniques and visualization, we show that:
(Include the key three stories about your data here)
A glimpse of the dataset is shown below:
library(tidyverse)
library(here)
library(skimr)
library(janitor)
library(lubridate) # to deal with date and time data
library(kableExtra) # Nicely-formatted tables (you don't need this; it's just trying this option)
dogs <- read_csv(here("data","dogs_LA_data.csv.csv")) # dogs looking for forever homes in LA (via Petfinder.com)
head(dogs)
## # A tibble: 6 x 49
## id organization_id url type species age gender size coat tags
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 62679064 CA448 https:/~ Dog Dog Young Female Medi~ <NA> "cha~
## 2 62679056 CA1763 https:/~ Cat Cat Young Female Medi~ <NA> "cha~
## 3 62679054 CA190 https:/~ Cat Cat Baby Female Medi~ <NA> "cha~
## 4 62679044 CA2713 https:/~ Cat Cat Baby Female Medi~ Short "c(\~
## 5 62679023 CA2713 https:/~ Dog Dog Baby Female Small Medi~ "c(\~
## 6 62679007 CA2713 https:/~ Dog Dog Adult Male Small Short "c(\~
## # i 39 more variables: name <chr>, description <chr>,
## # organization_animal_id <chr>, photos <chr>, videos <chr>, status <chr>,
## # status_changed_at <dttm>, published_at <dttm>, distance <dbl>,
## # breeds.primary <chr>, breeds.secondary <chr>, breeds.mixed <lgl>,
## # breeds.unknown <lgl>, colors.primary <chr>, colors.secondary <chr>,
## # colors.tertiary <chr>, attributes.spayed_neutered <lgl>,
## # attributes.house_trained <lgl>, attributes.declawed <lgl>, ...
Before conducting any analysis, data processing was performed. In this stage, the dataset was cleaned and prepared for easy manipulation.
tidydogs <- dogs %>%
filter(type == "Dog") %>%
clean_names() %>% # Careful with the autopredict in the next lines, because column names might change
rename(house_trained = attributes_house_trained,
spayed_neutered = attributes_spayed_neutered,
declawed = attributes_declawed,
spatial_needs = attributes_special_needs) %>%
select(id, age, gender, type, species, name, breeds_primary, distance, published_at, status, description, contact_address_address1, contact_address_city)
names(tidydogs) # list the new column names
## [1] "id" "age"
## [3] "gender" "type"
## [5] "species" "name"
## [7] "breeds_primary" "distance"
## [9] "published_at" "status"
## [11] "description" "contact_address_address1"
## [13] "contact_address_city"
Since the column names were already in good shape, it just changed a few (.) to (_). The original dataset has 49 columns with information about the dogs, and only 14 columns were selected for this study.
As part of the data exploration, I plotted the total number of dogs looking for forever home by age and gender, but only including my favorite breeds. I started by creating that list.
#Create a list of favorite/known breeds
fav_breeds <- c("Labrador Retriever",
"Husky",
"Shih Tzu",
"Belgian Shepherd / Malinois",
"Siberian Husky",
"Dutch Shepherd",
"Golden Retriever",
"Belgian Shepherd / Laekenois",
"German Shepherd Dog", "Shepherd", "Dachshund",
"Mixed Breed",
"Australian Shepherd",
"Akita")
For plotting, I use the ggplot package contained in
tidyverse and the geom_count function that is
used for discrete variables, and it creates a bubbleplot. See
more info here. geom_count creates a new dataframe on
the basis of the dataframe ‘tidydogs.’ As stated in the manual: “A way
of using the variables that geom_count has created, is by referring to
them with the “..” before, and after. By default, the
counts will be used (..count..), but we can also refer to
the proportions in the created dataframe by using:
size=..prop... (we specify group=class to
signal to geom_count that we want the proportions within
class) Below you can see the dataframe that was created through
geom_count”
#create plot using proportions by breed
tidydogs %>%
filter(breeds_primary %in% fav_breeds) %>% # using only the breeds_primary attribute for this selection
ggplot(aes(x = age, y = breeds_primary, color = gender)) +
geom_count(aes(size=..prop.., group = breeds_primary)) +
scale_size_area() + # ensures that counts of zero would be given size 0--learned this reading the documentation for geom_count :)
facet_wrap(~ gender)
I compared these results with the ones obtained while using the
default counts instead of prop
(proportions)
#create plot using total counts
tidydogs %>%
filter(breeds_primary %in% fav_breeds) %>%
ggplot(aes(x = age, y = breeds_primary, color = gender)) +
geom_count(aes(group = breeds_primary)) +
scale_size_area() + # ensures that counts of zero would be given size 0--learned this reading the documentation for geom_count :)
facet_wrap(~ gender)
Figure 2. Total number of dogs by breed and age. The plot shows only 13 of my favorite dogs breeds. From all breeds, female mixed-breed puppies makes the highest population with 30 dogs looking for forever home, followed by the group of young male huskies
Then, I compared the results with the actual data, by doing some data wrangling as used in lesson 2.
tidydogs %>%
filter(breeds_primary %in% fav_breeds) %>%
group_by(breeds_primary, age, gender) %>%
count() %>%
rename(count = n) %>%
arrange(-count) %>%
head() %>%
#You don't NEED the following two lines. They are just to give the table a better format in the html. See more ideas here: https://bookdown.org/yihui/rmarkdown-cookbook/kableextra.html
kable(format = "html") %>%
kable_styling(font_size = 11)
| breeds_primary | age | gender | count |
|---|---|---|---|
| Mixed Breed | Baby | Female | 30 |
| Husky | Young | Male | 29 |
| Labrador Retriever | Young | Male | 20 |
| German Shepherd Dog | Young | Male | 16 |
| German Shepherd Dog | Adult | Female | 13 |
| German Shepherd Dog | Baby | Male | 13 |
Based on these results, both plots are helpful for decision-making. The second plot that uses count and not proportions is helpful to select across breeds, and the first plot would be useful for making the decision while comparing results just within a specific breed but still having a whole picture of the options.
tidydogs %>%
na.omit() %>%
mutate(logdistance = log(distance)) %>%
ggplot(aes(x = breeds_primary, y = logdistance, fill = age)) +
geom_boxplot() +
coord_flip()
datadownload_time <- "2023-04-30 22:15:00"
ref_time <- as_datetime(datadownload_time) # convert to data-time format
# Get time difference in minutes to have another numerical attribute
tidydogs <- tidydogs %>%
mutate(ava_time = as_datetime(published_at)) %>%
mutate(timediff = as.numeric(difftime(ref_time, ava_time))) #get time difference in minutes
tidydogs %>%
filter(breeds_primary %in% fav_breeds) %>%
ggplot(aes(x = age, fill = gender)) +
geom_bar() +
facet_wrap(~breeds_primary)
That’s all for this week!