Business is Barking

Introduction

[1.1] Dogs come in all shapes, sizes, and personalities. Adoptable dogs are no exception. These dogs demand goods and services just like humans. Nearly every canine bone, leash, bowl or bed, comes to the market through an extensive supply chain forecasted to meet that demand. For example, bigger dogs need bigger beds, bones, collars, and cages. Smaller dog products take less retail shelf space and have lower freight costs. The goal is to minimize shortages and keep costs down. The job is not only about logistics and purchasing inventory but requires canine insights to accurately get the right products, to the correct region, at the right time.

Focus

[1.2] The project will focus on descriptive insights for adoptable dogs for the dataset on September 19, 2019, provided from PetFinder and Github. This analysis is intended to help canine manufacturers, distributors, and retailers optimize merchandise in regions with the highest probability of demand for those specific products and services. Specifically, we’ll focus on four areas: size, age, breed, and house trained.

Size: Since oversized and heavier products cost more to ship than smaller products. We’ll explore the states with the highest demand for products with similar attributes, allowing distributors to consolidate freight most optimally.
Age: Aging dogs require higher nutritional needs. For any given time, is the adoptable canine population aging?
Breed: Dog breeds vary in energy and temperament with breed and age. Dog walkers, fetch products, and even dog strollers benefit from insights into concentrations of dogs with energy needs.
House-Trained: Products such as hidden fences and training whistles are helpful in house training dogs. But are adoptable dogs more likely to already be house-trained?

Analytical Approach

[1.3] We’ll use methods of descriptive statistics to learn about dogs available for adoption on September 19, 2019. Graphical and numerical presentations will show summaries for breed types, sizes, age, and behavioral characteristics. Specifically, we’ll look for the patterns and correlations between location and descriptive statistics.

Mission

[1.4] With this data analysis, manufacturers can optimize their supply chain with more accurate forecasts for canine-specific products. Ultimately, the data analysis will save freight costs, prevent stockouts, and increase customer service levels.

Packages Required

[2.1] Packages required for this project are as follows:

list.of.packages <- c("tidyverse", "readr", "maps", "DT", "knitr", "rmarkdown", "ggthemes", "plotly", "mapproj")

[2.2] Suppression of messages and warnings regarding loading packages.

library(tidyverse)
library(readr)
library(maps)
library(DT)
library(knitr)
library(rmarkdown)
library(ggthemes)
library(plotly)
library(mapproj)

[2.3] Summary of Packages

library(tidyverse)     # easy installation of packages
library(readr)         # to easily import delimited data
library(maps)          # for geographical data
library(mapproj)       # to convert latitude/longitude into projected coordinates
library(DT)            # to create functional tables in HTML
library(knitr)         # for dynamic report generation
library(rmarkdown)     # to convert R Markdown documents into a variety of formats
library(ggthemes)      # to implement theme across report
library(plotly)        # for dynamic plotting

Data Preparation

[3.1] The project contains data used in The Pudding essay Finding Forever Homes written by Amber Thomas and published in October 2019.

Three datasets were downloaded from GitHub labeled Adoptable Dogs. dog_description.csv dog_moves.csv dog_travel.csv

[3.2] Our project uses Finding Forever Homes data initially collected from Petfinder.com on all adoptable dogs in the U.S. on a single day, specifically 09-20-2019.

The original purpose of the data was used for the Finding Forever Homes essay to highlight where a state’s adoptable dogs are imported from by state and why they were relocated. The essay draws conclusions about the benefits and risks of the transportation of dogs for adoption.
The data available comes in 3 csv files labeled dog_description.csv, dog_moves.csv,and dog_travel.csv.
- dog_description.csv has 58,180 entries with 36 variables. Each row represented an individual adoptable dog in the U.S. on September 20, 2019. Each dog has a unique I.D. number. Unless otherwise noted, all the data is exactly as reported by the shelter or rescue that posted an individual animal adoption on PetFinder.
- dog_moves.csv has 90 entries with 5 variables. The script used to process a file of adoptable dogs and their origin and destination locations to find the total numbers of imports and exports for each location.
- dog_travel.csv has 6,194 entries with 8 variables. Each row represents a dog available for adoption on September 20, 2019, somewhere in the U.S. Each of these dogs is described as being moved from another location to the current site.
Missing values are recorded using “NA” in original data sets.
There are a number of datasets, where the dog_travel.csv and dog_description.csv datasets can be joined via the common id column.

[3.3] Data is imported from csv as shown below:

dog_moves <- readr::read_csv(url("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-12-17/dog_moves.csv"))

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   location = col_character(),
##   exported = col_double(),
##   imported = col_double(),
##   total = col_double(),
##   inUS = col_logical()
## )

dog_travel <- readr::read_csv(url("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-12-17/dog_travel.csv"))

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   id = col_double(),
##   contact_city = col_character(),
##   contact_state = col_character(),
##   description = col_character(),
##   found = col_character(),
##   manual = col_character(),
##   remove = col_logical(),
##   still_there = col_logical()
## )

dog_descriptions <- readr::read_csv(url("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-12-17/dog_descriptions.csv"))

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_character(),
##   id = col_double(),
##   breed_mixed = col_logical(),
##   breed_unknown = col_logical(),
##   fixed = col_logical(),
##   house_trained = col_logical(),
##   declawed = col_logical(),
##   special_needs = col_logical(),
##   shots_current = col_logical(),
##   env_children = col_logical(),
##   env_dogs = col_logical(),
##   env_cats = col_logical(),
##   accessed = col_date(format = "")
## )
## ℹ Use `spec()` for the full column specifications.

After taking a look at the variables of each dataset, it was important to remove the variables that would not be needed or contribute to the story of our analysis. We decided to keep 14 of the original 36 variables in dog_description.csv. The columns that were not needed for the analysis were removed.

dog_descriptions$url <- NULL
dog_descriptions$species <- NULL
dog_descriptions$breed_secondary <- NULL
dog_descriptions$breed_mixed <- NULL
dog_descriptions$breed_unknown <- NULL
dog_descriptions$color_primary <- NULL
dog_descriptions$color_secondary <- NULL
dog_descriptions$color_tertiary <- NULL
dog_descriptions$fixed <- NULL
dog_descriptions$declawed <- NULL
dog_descriptions$special_needs <- NULL
dog_descriptions$shots_current <- NULL
dog_descriptions$env_children <- NULL
dog_descriptions$env_dogs <- NULL
dog_descriptions$env_cats <- NULL
dog_descriptions$tags <- NULL
dog_descriptions$photo <- NULL
dog_descriptions$status <- NULL
dog_descriptions$posted <- NULL
dog_descriptions$stateQ <- NULL
dog_descriptions$accessed <- NULL
dog_descriptions$type <- NULL

After evaluating column specifications it is important to address missing values. To do this utilize the is.na() function to determine which columns contain missing values.

colSums(is.na(dog_descriptions))

##              id          org_id   breed_primary             age             sex 
##               0               0               0               0               0 
##            size            coat   house_trained            name    contact_city 
##               0           30995               0               0               0 
##   contact_state     contact_zip contact_country     description 
##               0              12               0            8705

colSums(is.na(dog_moves))

## location exported imported    total     inUS 
##        0        9       52       39        0

colSums(is.na(dog_travel))

##            id  contact_city contact_state   description         found 
##             0             0             0             0             0 
##        manual        remove   still_there 
##          4047          4456          5875

From here it is good practive to standardize the “NA” values to have a name, in this case “Unknown”. Some columns this is more challenging to do due to numeric values.

dog_descriptions$coat[is.na(dog_descriptions$coat)] <- "Unknown"
dog_descriptions$description[is.na(dog_descriptions$description)] <- "Unknown"
dog_descriptions$contact_zip[is.na(dog_descriptions$contact_zip)] <- "Unknown"
dog_travel$manual[is.na(dog_travel$manual)] <- "Unknown"
dog_travel$remove[is.na(dog_travel$remove)] <- "Unknown"
dog_travel$still_there[is.na(dog_travel$still_there)] <- "Unknown"

From here we can see the only numeric values that may be later used for calculations are in dog_moves.csv. Character variables that will be utilized for plotting are clean.

table(dog_descriptions$size)

## 
## Extra Large       Large      Medium       Small 
##         931       15761       29908       11580

table(dog_descriptions$age)

## 
##  Adult   Baby Senior  Young 
##  27955   9397   4634  16194

These datasets are directly based off Petfinder Data for a single day and are mainly character variables. After looking at the data there do not appear to be any outliers.

[3.4] We will need to modify our table in the final analysis and learn to utilize the knit::kable() function.

[3.5] When we are analyzing the Petfinder Data there will be some variables we will need to closely monitor. For example in the dog_moves.csv the exported and imported dogs do not always correctly add to the total dogs column. This will be a point of concern when matching state data to the variables we chose to analyze. We will need to do checks for accuracy in our final analysis.

Proposed Exploratory Data Analysis

[4.1] The project requires the observations to be broken up in subsets. We have broken the United States out into regions noted on this map below. We plan on doing cluster analysis for categorical variables. Summarizing the data will provide insights.

However, our Data must be tidy in order to pull out insights. We need to create new variables to get averages in each region. We will compute percentages and quartiles for the variables.

We must create a new variable that we’ll call region. This map is an illustration.

Frequency tables will be created in order to plot the data.

[4.2] Expected plots and table illustrations. We will make frequency datasets in order to gather statistical information and plot the data. We will need to make some tables for ranking our data.

[4.3] What do we need to learn about to achieve our mission?

We want to summarize the data and find:

1.) The mean of dog age, size, sex for each region.

2.) Is there a correlation between sex and size? And location?

3.) Summarize frequencies of dog attributes for each region or state.

4.) Are there any insights into where categorical attributes concentrate in the United States?

5.) Do certain primary breeds attract owners in certain climates?

6.) Where are most of the house-trained dogs?

7.) Is there a correlation of any variable with a dog being house-trained?

[4.4] Any new techniques needed?

We will need to make some tables for ranking our data. Knowing the proportions for the descriptive stats will be helpful.