Introduction

In New York City, are certain dog breeds more common than others by zip code?

Knowing which dogs breeds are most prevalent in each subsection of the city could help with targeted adoption strategies. For example, animal shelters can host breed-specific adoption events in zip codes that are more inclined to adopting a certain dog breed. Knowing which zip codes favor dog breeds that are unlikely to be found in shelters will also help shelters either promote adoption in those areas, or focus their efforts elsewhere.

The following libraries have been used in this project for data cleaning, visualization, and statistical tests.

library(tidyverse)
library(magrittr)
library(ggplot2)
library(DT)
library(RSocrata)
library(ggplot2)
library(choroplethr)
library(choroplethrZip)
library(data.table)
library(gplots)
library(rvest)
library(stringr)
library(statsr)

Data

Data collection: The source for this data is the NYC Open Data Dog Licensing Dataset. The data is using the amount of active dog licenses in 2016, collected through the Department of Health and Mental Hygiene Dog Licensing System.

Cases: The cases are unique dog licenses that were active during the year 2016.

Variables: I will be studying dog breed and zip code, both of which are categorical variables.

Type of study: This is an observational study, including all dogs licensed in NYC during 2016.

Scope of inference - generalizability: The population of interest is dogs with active licenses in 2016, and this analysis can be generalized to the population of all dogs in NYC (licensed and unlicensed). The majority of dogs in NYC (80%) are unlicensed, and we will assume that a random sample of licensed NYC dogs are an accurate representation of all NYC dogs. Although the analysis should be able to be generalized to the population, there are a few concerns:

  • Dog licenses must be renewed annually to remain active, so the sample may underrepresent the amount of older dogs in NYC
  • Shelters register a dog license to everyone who adopts a dog - this may not be the case for other methods of obtaining a dog

Scope of inference - causality: Since this is an observational study, it would be challenging to establish causality between zip code and dog breed.

Exploratory Data Analysis

To perform exploratory data analysis, we first obtain and clean the dog license dataset.

Data Preparation

The following steps were taken to tidy the data:

  • Use a token to extract dog licensing data from socrata
  • Add a column showing the age of each dog based on the date that the dataset was last updated (September 09 2018)
  • Change all text columns to lowercase to help with text analysis
  • Standardize dog breeds (for example, merge “german shepherd dog” and “german shepherd”)
  • Standardize boroughs (for example, “staten is” and “staten island” need to be combined)
  • Subset the data to exclude uncommon dog breeds and zip codes with limited information
# load data
token <- "ew2rEMuESuzWPqMkyPfOSGJgE"
dogs <- read.socrata("https://data.cityofnewyork.us/resource/nu7n-tubp.csv", app_token = token)


dogs$animalbirth <- as.Date(strptime(dogs$animalbirth, format = "%Y-%m-%d")) 
data_updated <- as.Date(strptime('2018-09-10', format = "%Y-%m-%d")) 
dogs$age_days <- as.vector(data_updated - dogs$animalbirth)


# change text columns to lowercase
dogs$animalgender<- sapply(dogs$animalgender, tolower) 
dogs$animalname <- sapply(dogs$animalname, tolower)
dogs$borough <- sapply(dogs$borough, tolower)
dogs$breedname <- sapply(dogs$breedname, tolower)

# column value cleanup
dogs$borough <- gsub('staten is(?!land)','staten island',dogs$borough, perl = TRUE)
dogs$borough <- gsub('new york','manhattan',dogs$borough, perl = TRUE)
dogs$breedname <- gsub('(american pit bull mix / pit bull mix)|(american pit bull terrier/pit bull)','pitbull',dogs$breedname)
dogs$breedname <- gsub(' crossbreed|(,.+)','',dogs$breedname)
dogs$breedname <- gsub('german shepherd dog','german shepherd',dogs$breedname)
dogs$breedname <- gsub('german shepherd dog','german shepherd',dogs$breedname)

# subset the data
dogs %<>%
  select(animalgender, animalname, animalbirth, borough, age_days, breedname, zipcode) %>%
  group_by(breedname) %>%   #filter to exclude uncommon dogs
  filter(n()>2000) %>% 
  group_by(borough) %>%   #filter to exclude outer boroughs
  filter(n()>20) %>%  
  group_by(zipcode) %>%
  filter(n()>20) %>%
  filter(breedname != 'unknown')

datatable(dogs)

Amount of Dog Licences per Borough

The following summary table shows the quantity of dogs in each borough.

balloonplot(t(table(dogs$breedname,dogs$borough)), main ="Dog Breeds by Borough", xlab ="", ylab="",
            label = FALSE, show.margins = FALSE)

Most Common Dog Breed by Zipcode

For a more detailed glimpse at dog breeds by location, the table below shows the most common dog breed within each zipcode.

mostrepeated <- function(x) as(names(which.max(table(x))), mode(x)) 
common_zip <- setDT(as.data.frame(tapply(dogs$breedname, dogs$zipcode, FUN = mostrepeated)), keep.rownames = "region")[] %>% rename("value" = "tapply(dogs$breedname, dogs$zipcode, FUN = mostrepeated)")
nyc_fips = c(36005, 36047, 36061, 36081, 36085)

zip_choropleth(common_zip,
               county_zoom = nyc_fips,
               num_colors = 8,
               legend="Breed Name",
               title="Most Popular Dog Breed by Zip Code") +
  scale_fill_brewer(palette="Set3") 

Inference

If your data fails some conditions and you can’t use a theoretical method, then you should use simulation. If you can use both methods, then you should use both methods. It is your responsibility to figure out the appropriate methodology.

Chi-square Test

In order to evaluate if there is an association between borough of residence and preferred dog breed, we will perform a chi-square test for two-way tables.

Hypothesis Test

\(H_0\): There is no association between borough of residence and preferred dog breed

\(H_1\): There is an association between borough of residence and preferred dog breed.

Check Conditions

  • The variable under study is categorical.

This condition is met, since dog breed is a categorical variable.

  • The expected value of the number of sample observations in each level of the variable is at least 5.

This condition is met, all boroughs have more than 5 licenses for each dog breed.

table(dogs$breedname,dogs$borough)
##                       
##                        bronx brooklyn manhattan queens staten island
##   beagle                 261      800       952    720           308
##   chihuahua             1001     1818      2311   1262           403
##   german shepherd        336      966       734    903           442
##   havanese                90      294      1188    304           139
##   jack russell terrier   188      542       750    371           177
##   labrador retriever     443     1770      2602   1281           855
##   maltese                626     1123      1379   1325           438
##   pitbull               1247     2024      1388   1366           688
##   pomeranian             270      594       727    544           180
##   poodle                 543     1064      2241   1148           289
##   shih tzu              1293     2165      2174   1809           787
##   yorkshire terrier     1369     2166      2268   1860           778
  • The sampling method is simple random sampling.

This condition is NOT met - the sample was obtained from dogs that had active licenses in 2016. Instead, we will perform simulation based inference.

Simulation Based Inference

We perform a simulation based hypothesis test below to test if borough and breedname are independent, using the proportion statistic. The outcome of the simulationreturns a p-value of 0, so we reject the null hypothesis in favor of the alternative hypothesis.

\(H_1\): There is an association between borough of residence and preferred dog breed.

inference(breedname, borough, data = dogs, type = "ht", statistic = "proportion", method = "theoretical")
## Warning: Use alternative = "greater" for chi-square test
## Response variable: categorical (12 levels) 
## Explanatory variable: categorical (5 levels) 
## Observed:
##                y
## x               beagle chihuahua german shepherd havanese
##   bronx            261      1001             336       90
##   brooklyn         800      1818             966      294
##   manhattan        952      2311             734     1188
##   queens           720      1262             903      304
##   staten island    308       403             442      139
##                y
## x               jack russell terrier labrador retriever maltese pitbull
##   bronx                          188                443     626    1247
##   brooklyn                       542               1770    1123    2024
##   manhattan                      750               2602    1379    1388
##   queens                         371               1281    1325    1366
##   staten island                  177                855     438     688
##                y
## x               pomeranian poodle shih tzu yorkshire terrier
##   bronx                270    543     1293              1369
##   brooklyn             594   1064     2165              2166
##   manhattan            727   2241     2174              2268
##   queens               544   1148     1809              1860
##   staten island        180    289      787               778
## 
## Expected:
##                y
## x                 beagle chihuahua german shepherd havanese
##   bronx         388.0459  867.0738        431.4314 257.1234
##   brooklyn      775.6868 1733.2430        862.4127 513.9786
##   manhattan     947.1619 2116.3975       1053.0596 627.5999
##   queens        652.5467 1458.0909        725.5048 432.3846
##   staten island 277.5588  620.1947        308.5914 183.9135
##                y
## x               jack russell terrier labrador retriever   maltese
##   bronx                     258.7823           886.9802  624.1145
##   brooklyn                  517.2946          1773.0349 1247.5778
##   manhattan                 631.6489          2164.9859 1523.3702
##   queens                    435.1742          1491.5659 1049.5250
##   staten island             185.1001           634.4332  446.4124
##                y
## x                 pitbull pomeranian    poodle  shih tzu yorkshire terrier
##   bronx          856.6103   295.4048  674.3908 1049.9314         1077.1112
##   brooklyn      1712.3267   590.5015 1348.0779 2098.7672         2153.0984
##   manhattan     2090.8575   721.0390 1646.0870 2562.7254         2629.0672
##   queens        1440.4951   496.7595 1134.0707 1765.5882         1811.2944
##   staten island  612.7104   211.2952  482.3737  750.9878          770.4288
## 
## H0: borough and breedname are independent
## HA: borough and breedname are dependent
## chi_sq = 2873.2497, df = 44, p_value = 0

Conclusion

Write a brief summary of your findings without repeating your statements from earlier. Also include a discussion of what you have learned about your research question and the data you collected. You may also want to include ideas for possible future research.

I was able to find that the breeds of dogs vary greatly by borough, confirmed by a simulation-based hypothesis test. In the future, it would be interesting to perform a chi-square test on a dataset that is randomly sampled and therefore meets conditions for inference.

This data has also allowed me to create visualizations about other metrics, such as most common dog names and average age of dogs.

Most Common Dog Names by Gender

require("ggrepel")
dogs %>%
  group_by(animalgender, animalname) %>%
  filter(animalname != "name not provided") %>%
  filter(animalname != "unknown") %>%
  filter(animalgender %in% c("f","m")) %>%
  summarise(count_name = n()) %>%
  top_n(10) %>%
  ungroup() %>%
  ggplot(aes(animalgender, count_name)) + geom_label_repel(aes(label = animalname, color = animalgender, size = count_name)) + theme(legend.position="none") + facet_wrap(~animalgender, scales = "free") + theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

Most Common Dog Names by Gender

require("ggrepel")
dogs %>%
  group_by(breedname, animalname) %>%
  filter(animalname != "name not provided") %>%
  filter(animalname != "unknown") %>%
  summarise(count_name = n()) %>%
  top_n(5) %>%
  ungroup() %>%
  ggplot(aes(breedname, count_name)) + geom_label_repel(aes(label = animalname, color = breedname)) + theme(legend.position="none") + facet_wrap(~breedname, scales = "free") + theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

Average Dog Age by Breed

Looking at average age of each dog breed gives some interesting insights. There is a large range between jack russel terrier with an average age of 3326 days (or 9 years) and pitbulls with an average age of 2391 days (6.5 years).

dogs %>%
  group_by(breedname) %>%
  summarise(average_age = mean(age_days))
## # A tibble: 12 x 2
##    breedname            average_age
##    <chr>                      <dbl>
##  1 beagle                     2890.
##  2 chihuahua                  2912.
##  3 german shepherd            2912.
##  4 havanese                   2644.
##  5 jack russell terrier       3326.
##  6 labrador retriever         2880.
##  7 maltese                    2846.
##  8 pitbull                    2391.
##  9 pomeranian                 2753.
## 10 poodle                     2968.
## 11 shih tzu                   2846.
## 12 yorkshire terrier          2757.

All in all, this dataset has many variables to consider but fails to meet all the necessary conditions for most statistical inferences.