In New York City, are certain dog breeds more common than others by zip code?
Knowing which dogs breeds are most prevalent in each subsection of the city could help with targeted adoption strategies. For example, animal shelters can host breed-specific adoption events in zip codes that are more inclined to adopting a certain dog breed. Knowing which zip codes favor dog breeds that are unlikely to be found in shelters will also help shelters either promote adoption in those areas, or focus their efforts elsewhere.
The following libraries have been used in this project for data cleaning, visualization, and statistical tests.
library(tidyverse)
library(magrittr)
library(ggplot2)
library(DT)
library(RSocrata)
library(ggplot2)
library(choroplethr)
library(choroplethrZip)
library(data.table)
library(gplots)
library(rvest)
library(stringr)
library(statsr)
Data collection: The source for this data is the NYC Open Data Dog Licensing Dataset. The data is using the amount of active dog licenses in 2016, collected through the Department of Health and Mental Hygiene Dog Licensing System.
Cases: The cases are unique dog licenses that were active during the year 2016.
Variables: I will be studying dog breed and zip code, both of which are categorical variables.
Type of study: This is an observational study, including all dogs licensed in NYC during 2016.
Scope of inference - generalizability: The population of interest is dogs with active licenses in 2016, and this analysis can be generalized to the population of all dogs in NYC (licensed and unlicensed). The majority of dogs in NYC (80%) are unlicensed, and we will assume that a random sample of licensed NYC dogs are an accurate representation of all NYC dogs. Although the analysis should be able to be generalized to the population, there are a few concerns:
Scope of inference - causality: Since this is an observational study, it would be challenging to establish causality between zip code and dog breed.
To perform exploratory data analysis, we first obtain and clean the dog license dataset.
The following steps were taken to tidy the data:
# load data
token <- "ew2rEMuESuzWPqMkyPfOSGJgE"
dogs <- read.socrata("https://data.cityofnewyork.us/resource/nu7n-tubp.csv", app_token = token)
dogs$animalbirth <- as.Date(strptime(dogs$animalbirth, format = "%Y-%m-%d"))
data_updated <- as.Date(strptime('2018-09-10', format = "%Y-%m-%d"))
dogs$age_days <- as.vector(data_updated - dogs$animalbirth)
# change text columns to lowercase
dogs$animalgender<- sapply(dogs$animalgender, tolower)
dogs$animalname <- sapply(dogs$animalname, tolower)
dogs$borough <- sapply(dogs$borough, tolower)
dogs$breedname <- sapply(dogs$breedname, tolower)
# column value cleanup
dogs$borough <- gsub('staten is(?!land)','staten island',dogs$borough, perl = TRUE)
dogs$borough <- gsub('new york','manhattan',dogs$borough, perl = TRUE)
dogs$breedname <- gsub('(american pit bull mix / pit bull mix)|(american pit bull terrier/pit bull)','pitbull',dogs$breedname)
dogs$breedname <- gsub(' crossbreed|(,.+)','',dogs$breedname)
dogs$breedname <- gsub('german shepherd dog','german shepherd',dogs$breedname)
dogs$breedname <- gsub('german shepherd dog','german shepherd',dogs$breedname)
# subset the data
dogs %<>%
select(animalgender, animalname, animalbirth, borough, age_days, breedname, zipcode) %>%
group_by(breedname) %>% #filter to exclude uncommon dogs
filter(n()>2000) %>%
group_by(borough) %>% #filter to exclude outer boroughs
filter(n()>20) %>%
group_by(zipcode) %>%
filter(n()>20) %>%
filter(breedname != 'unknown')
datatable(dogs)
The following summary table shows the quantity of dogs in each borough.
balloonplot(t(table(dogs$breedname,dogs$borough)), main ="Dog Breeds by Borough", xlab ="", ylab="",
label = FALSE, show.margins = FALSE)
For a more detailed glimpse at dog breeds by location, the table below shows the most common dog breed within each zipcode.
mostrepeated <- function(x) as(names(which.max(table(x))), mode(x))
common_zip <- setDT(as.data.frame(tapply(dogs$breedname, dogs$zipcode, FUN = mostrepeated)), keep.rownames = "region")[] %>% rename("value" = "tapply(dogs$breedname, dogs$zipcode, FUN = mostrepeated)")
nyc_fips = c(36005, 36047, 36061, 36081, 36085)
zip_choropleth(common_zip,
county_zoom = nyc_fips,
num_colors = 8,
legend="Breed Name",
title="Most Popular Dog Breed by Zip Code") +
scale_fill_brewer(palette="Set3")
If your data fails some conditions and you can’t use a theoretical method, then you should use simulation. If you can use both methods, then you should use both methods. It is your responsibility to figure out the appropriate methodology.
In order to evaluate if there is an association between borough of residence and preferred dog breed, we will perform a chi-square test for two-way tables.
\(H_0\): There is no association between borough of residence and preferred dog breed
\(H_1\): There is an association between borough of residence and preferred dog breed.
This condition is met, since dog breed is a categorical variable.
This condition is met, all boroughs have more than 5 licenses for each dog breed.
table(dogs$breedname,dogs$borough)
##
## bronx brooklyn manhattan queens staten island
## beagle 261 800 952 720 308
## chihuahua 1001 1818 2311 1262 403
## german shepherd 336 966 734 903 442
## havanese 90 294 1188 304 139
## jack russell terrier 188 542 750 371 177
## labrador retriever 443 1770 2602 1281 855
## maltese 626 1123 1379 1325 438
## pitbull 1247 2024 1388 1366 688
## pomeranian 270 594 727 544 180
## poodle 543 1064 2241 1148 289
## shih tzu 1293 2165 2174 1809 787
## yorkshire terrier 1369 2166 2268 1860 778
This condition is NOT met - the sample was obtained from dogs that had active licenses in 2016. Instead, we will perform simulation based inference.
We perform a simulation based hypothesis test below to test if borough and breedname are independent, using the proportion statistic. The outcome of the simulationreturns a p-value of 0, so we reject the null hypothesis in favor of the alternative hypothesis.
\(H_1\): There is an association between borough of residence and preferred dog breed.
inference(breedname, borough, data = dogs, type = "ht", statistic = "proportion", method = "theoretical")
## Warning: Use alternative = "greater" for chi-square test
## Response variable: categorical (12 levels)
## Explanatory variable: categorical (5 levels)
## Observed:
## y
## x beagle chihuahua german shepherd havanese
## bronx 261 1001 336 90
## brooklyn 800 1818 966 294
## manhattan 952 2311 734 1188
## queens 720 1262 903 304
## staten island 308 403 442 139
## y
## x jack russell terrier labrador retriever maltese pitbull
## bronx 188 443 626 1247
## brooklyn 542 1770 1123 2024
## manhattan 750 2602 1379 1388
## queens 371 1281 1325 1366
## staten island 177 855 438 688
## y
## x pomeranian poodle shih tzu yorkshire terrier
## bronx 270 543 1293 1369
## brooklyn 594 1064 2165 2166
## manhattan 727 2241 2174 2268
## queens 544 1148 1809 1860
## staten island 180 289 787 778
##
## Expected:
## y
## x beagle chihuahua german shepherd havanese
## bronx 388.0459 867.0738 431.4314 257.1234
## brooklyn 775.6868 1733.2430 862.4127 513.9786
## manhattan 947.1619 2116.3975 1053.0596 627.5999
## queens 652.5467 1458.0909 725.5048 432.3846
## staten island 277.5588 620.1947 308.5914 183.9135
## y
## x jack russell terrier labrador retriever maltese
## bronx 258.7823 886.9802 624.1145
## brooklyn 517.2946 1773.0349 1247.5778
## manhattan 631.6489 2164.9859 1523.3702
## queens 435.1742 1491.5659 1049.5250
## staten island 185.1001 634.4332 446.4124
## y
## x pitbull pomeranian poodle shih tzu yorkshire terrier
## bronx 856.6103 295.4048 674.3908 1049.9314 1077.1112
## brooklyn 1712.3267 590.5015 1348.0779 2098.7672 2153.0984
## manhattan 2090.8575 721.0390 1646.0870 2562.7254 2629.0672
## queens 1440.4951 496.7595 1134.0707 1765.5882 1811.2944
## staten island 612.7104 211.2952 482.3737 750.9878 770.4288
##
## H0: borough and breedname are independent
## HA: borough and breedname are dependent
## chi_sq = 2873.2497, df = 44, p_value = 0
Write a brief summary of your findings without repeating your statements from earlier. Also include a discussion of what you have learned about your research question and the data you collected. You may also want to include ideas for possible future research.
I was able to find that the breeds of dogs vary greatly by borough, confirmed by a simulation-based hypothesis test. In the future, it would be interesting to perform a chi-square test on a dataset that is randomly sampled and therefore meets conditions for inference.
This data has also allowed me to create visualizations about other metrics, such as most common dog names and average age of dogs.
require("ggrepel")
dogs %>%
group_by(animalgender, animalname) %>%
filter(animalname != "name not provided") %>%
filter(animalname != "unknown") %>%
filter(animalgender %in% c("f","m")) %>%
summarise(count_name = n()) %>%
top_n(10) %>%
ungroup() %>%
ggplot(aes(animalgender, count_name)) + geom_label_repel(aes(label = animalname, color = animalgender, size = count_name)) + theme(legend.position="none") + facet_wrap(~animalgender, scales = "free") + theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
require("ggrepel")
dogs %>%
group_by(breedname, animalname) %>%
filter(animalname != "name not provided") %>%
filter(animalname != "unknown") %>%
summarise(count_name = n()) %>%
top_n(5) %>%
ungroup() %>%
ggplot(aes(breedname, count_name)) + geom_label_repel(aes(label = animalname, color = breedname)) + theme(legend.position="none") + facet_wrap(~breedname, scales = "free") + theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
Looking at average age of each dog breed gives some interesting insights. There is a large range between jack russel terrier with an average age of 3326 days (or 9 years) and pitbulls with an average age of 2391 days (6.5 years).
dogs %>%
group_by(breedname) %>%
summarise(average_age = mean(age_days))
## # A tibble: 12 x 2
## breedname average_age
## <chr> <dbl>
## 1 beagle 2890.
## 2 chihuahua 2912.
## 3 german shepherd 2912.
## 4 havanese 2644.
## 5 jack russell terrier 3326.
## 6 labrador retriever 2880.
## 7 maltese 2846.
## 8 pitbull 2391.
## 9 pomeranian 2753.
## 10 poodle 2968.
## 11 shih tzu 2846.
## 12 yorkshire terrier 2757.
All in all, this dataset has many variables to consider but fails to meet all the necessary conditions for most statistical inferences.