Data Preparation

library(tidyverse)
library(magrittr)
library(ggplot2)
library(DT)
library(RSocrata)

# load data
token <- "ew2rEMuESuzWPqMkyPfOSGJgE"
dogs <- read.socrata("https://data.cityofnewyork.us/resource/nu7n-tubp.csv", app_token = token)

# find the age in days of each dog based on the date that the dataset was last updated (September 09 2018)
dogs$animalbirth <- as.Date(strptime(dogs$animalbirth, format = "%Y-%m-%d")) 
data_updated <- as.Date(strptime('2018-09-10', format = "%Y-%m-%d")) 
dogs$age_days <- as.vector(data_updated - dogs$animalbirth)

# change text columns to lowercase
dogs$animalgender<- sapply(dogs$animalgender, tolower) 
dogs$animalname <- sapply(dogs$animalname, tolower)
dogs$borough <- sapply(dogs$borough, tolower)
dogs$breedname <- sapply(dogs$breedname, tolower)

#column value cleanup
dogs$borough <- gsub('staten is(?!land)','staten island',dogs$borough, perl = TRUE)
dogs$borough <- gsub('new york','manhattan',dogs$borough, perl = TRUE)
dogs$breedname <- gsub('(american pit bull mix / pit bull mix)|(american pit bull terrier/pit bull)','pitbull',dogs$breedname)
dogs$breedname <- gsub(' crossbreed|(,.+)','',dogs$breedname)

# subset the data to only keep columns of interest and values with a minimum amount of occurances
dogs %<>%
  select(animalgender, animalname, animalbirth, borough, age_days, breedname, zipcode) %>%
  group_by(breedname) %>%   #filter to exclude uncommon dogs
  filter(n()>100) %>% 
  group_by(borough) %>%   #filter to exclude outer boroughs
  filter(n()>20) %>%  
  group_by(zipcode) %>%
  filter(n()>20) %>%
  filter(breedname != 'unknown')

Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.

Question 1: In New York City, are certain dog breeds more common than others in certain boroughs or zip codes?

Question 2: In New York City, does average age differ between different dog breeds?

Cases

What are the cases, and how many are there?

Each case is a unique New York City dog license that was active in 2016. There are 101,611 active dog licenses in New York City in 2016 for known dog breeds with at least 100 registered dogs.

nrow(dogs)
## [1] 101611

Data collection

Describe the method of data collection.

The data is collected through the Department of Health and Mental Hygiene Dog Licensing System.

Type of study

What type of study is this (observational/experiment)?

This data set is observational.

Data Source

If you collected the data, state self-collected. If not, provide a citation/link.

The source for this data is the NYC Open Data Dog Licensing Dataset.

Response Variable

What is the response variable, and what type is it (numerical/categorical)?

Question 1: The response variable is dog breed, and it is a catergorial variable.

Question 2: The response variable is dog age, and it is a numerical variable.

Explanatory Variable

Explanatory: What is the explanatory variable(s), and what type is it (numerical/categorical)?

Question 1: The explanatory variable is location (Borough and Zip Code, both categorical).

Question 2: The explanatory variable is dog breed (categorical).

Relevant summary statistics

Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

First, we can look at a preview of what our data looks like:

datatable(head(dogs))

Amount of Dog Licences per Borough

The following summary table shows the quantity of dogs in each borough.

table(dogs$borough)
## 
##         bronx      brooklyn     manhattan        queens staten island 
##         10828         25316         35484         20846          9137

Dog Breed by Borough

This next chart summarizes the most common dog breeds and how many of them live in each borough. Havanese dogs are an interesting subset, with 1188 Havanese dogs registered in Manhattan but only 90 in the Bronx.

common_dogs <- dogs %>%
  group_by(breedname) %>% 
  filter(n()>2000) 

table(common_dogs$breedname, common_dogs$borough)
##                       
##                        bronx brooklyn manhattan queens staten island
##   beagle                 261      800       952    720           308
##   chihuahua             1001     1818      2311   1263           403
##   havanese                90      294      1188    305           139
##   jack russell terrier   188      542       750    371           177
##   labrador retriever     443     1770      2602   1284           855
##   maltese                626     1123      1379   1330           438
##