options(repos = c(CRAN = "https://cran.rstudio.com"))
In order to create a Tidyverse Vignette, I will be using the kaggle source where i will look at all dogs’ bites that have been registered in New York City from 2015 to 2021.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
NYC_Dogs_Bites <- read_csv("https://raw.githubusercontent.com/Pascaltafo2025/DATA-607/main/Dog_Bites_Data.csv")
## Rows: 22663 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): DateOfBite, Species, Breed, Age, Gender, Borough, ZipCode
## dbl (1): UniqueID
## lgl (1): SpayNeuter
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(NYC_Dogs_Bites)
## # A tibble: 6 × 9
## UniqueID DateOfBite Species Breed Age Gender SpayNeuter Borough ZipCode
## <dbl> <chr> <chr> <chr> <chr> <chr> <lgl> <chr> <chr>
## 1 1 January 01 2018 DOG UNKN… <NA> U FALSE Brookl… 11220
## 2 2 January 04 2018 DOG UNKN… <NA> U FALSE Brookl… <NA>
## 3 3 January 06 2018 DOG Pit … <NA> U FALSE Brookl… 11224
## 4 4 January 08 2018 DOG Mixe… 4 M FALSE Brookl… 11231
## 5 5 January 09 2018 DOG Pit … <NA> U FALSE Brookl… 11224
## 6 6 January 03 2018 DOG BASE… 4Y M FALSE Brookl… 11231
colnames(NYC_Dogs_Bites)
## [1] "UniqueID" "DateOfBite" "Species" "Breed" "Age"
## [6] "Gender" "SpayNeuter" "Borough" "ZipCode"
sort(unique(year(mdy(NYC_Dogs_Bites$DateOfBite))))
## [1] 2015 2016 2017 2018 2019 2020 2021
Our dataset is very large because it contains the dogs bites in NYC from 2015 to 2021. I decided to conduct my analysis only in the year 2021.
In the section, i will column(s) that i believe are not relevant for the analysis, select only the records for my year of interest.
library(lubridate)
# Let's do our first clean up by selecting our year of interest and getting ride of the unnecessary columns.
NYC_Dog_Bites2021 <- NYC_Dogs_Bites %>%
select(-UniqueID,-Species,-ZipCode) %>% #remove UniqueID column
mutate(Year = year(mdy(DateOfBite))) %>% # convert the Date format MDY to only year
filter(Year == 2021) # keep only year 2021
head(NYC_Dog_Bites2021,10)
## # A tibble: 10 × 7
## DateOfBite Breed Age Gender SpayNeuter Borough Year
## <chr> <chr> <chr> <chr> <lgl> <chr> <dbl>
## 1 January 03 2021 FRENCH BULL DOG & PIT … 4 F FALSE Brookl… 2021
## 2 January 02 2021 UNKNOWN <NA> U FALSE Brookl… 2021
## 3 January 05 2021 Rhodesian Ridgeback 10 M FALSE Brookl… 2021
## 4 January 01 2021 MIXED BREED 5 F FALSE Brookl… 2021
## 5 January 03 2021 MIXED 5 M TRUE Brookl… 2021
## 6 January 07 2021 Chihuahua Crossbreed 5 M FALSE Brookl… 2021
## 7 January 10 2021 Beagle 9M M FALSE Brookl… 2021
## 8 January 09 2021 Pit Bull <NA> U FALSE Brookl… 2021
## 9 January 09 2021 Maltese 6 M FALSE Brookl… 2021
## 10 January 09 2021 MIXED BREED 3 F TRUE Brookl… 2021
My goal is to compare dogs bites incidents by breed and check if there is a correlation between the dog breed and their Sterilization status.
1) Let’s find the number of incidents by breed.
# Data preparation for analysis
NYC_Dog_Bites2021 <- NYC_Dog_Bites2021 %>%
filter(Breed != "UNKNOWN") # remove the Unknown dogs breed
# Let's count the incidents by breed
DogsBites_by_breed <- NYC_Dog_Bites2021 %>%
group_by(Breed) %>%
summarise(n_incidents = n()) %>%
arrange(desc(n_incidents))
DogsBites_by_breed
## # A tibble: 330 × 2
## Breed n_incidents
## <chr> <int>
## 1 Pit Bull 521
## 2 MIXED 112
## 3 German Shepherd 85
## 4 Shih Tzu 58
## 5 Chihuahua 56
## 6 Rottweiler 53
## 7 Yorkshire Terrier 46
## 8 Maltese 41
## 9 MIXED BREED 38
## 10 Poodle, Standard 34
## # ℹ 320 more rows
This result shows that Pit Bull is the most mentioned breed with 521 cases of dogs bites registered in New York City in 2021 followed by the Mixed, the German Shepherd breeds respectively with 112 and 85 cases.
2) Let’s find the relationship between the dog breed and whether or not they are spayNeuter.
Breed_SpayNeuter <- NYC_Dog_Bites2021 %>%
count(Breed, SpayNeuter) %>%
group_by(Breed) %>%
mutate(percent = 100 * n / sum(n)) %>%
arrange(desc(n))
head(Breed_SpayNeuter,10)
## # A tibble: 10 × 4
## # Groups: Breed [9]
## Breed SpayNeuter n percent
## <chr> <lgl> <int> <dbl>
## 1 Pit Bull FALSE 488 93.7
## 2 MIXED FALSE 88 78.6
## 3 German Shepherd FALSE 68 80
## 4 Rottweiler FALSE 48 90.6
## 5 Chihuahua FALSE 46 82.1
## 6 Yorkshire Terrier FALSE 41 89.1
## 7 Shih Tzu FALSE 38 65.5
## 8 MIXED BREED FALSE 34 89.5
## 9 Pit Bull TRUE 33 6.33
## 10 Poodle, Standard FALSE 32 94.1
We will generate a plot that will show the proportion of dog bites incidents by breed depending if they are Spayed/Neutered or not. Since we have too many breeds registered, we will on the top 10 most registered breeds.
# let's get the top 10 Breeds by count
top10_Breeds <- NYC_Dog_Bites2021 %>%
filter(!is.na(Breed),
str_to_lower(Breed) != "unknown",
!is.na(SpayNeuter)) %>%
add_count(Breed, name = "breed_total") %>%
filter(breed_total %in% tail(sort(unique(breed_total)), 10))
# Create plot
ggplot(data= top10_Breeds, aes(x = fct_reorder(Breed, Breed, length), y = 1, fill = SpayNeuter)) +
geom_col(position = "fill") +
coord_flip() +
scale_y_continuous(labels = scales::percent) +
labs(
title = "Spay/Neuter Status by Top 5 Dog Breeds in Bite Incidents",
x = "Dog Breed",
y = "Proportion of Incidents",
fill = "Spay/Neuter Status") +
theme_minimal()
Comparing 2015 data to 2021 data provided. I will filter to retrieve 2015 NYC dog bites. TidyVerseEXTEND - Paula Brown
NYC_Dog_Bites2015 <- NYC_Dogs_Bites %>%
select(-UniqueID,-Species,-ZipCode) %>%
mutate(Year = year(mdy(DateOfBite))) %>%
filter(Year == 2015, Breed != "UNKNOWN")
We retrieve top 10 breeds for 2015 and 2021 dog bites to combine them TidyVerseEXTEND - Paula Brown
top10_2015 <- NYC_Dog_Bites2015 %>%
count(Breed, name = "Incidents_2015") %>%
arrange(desc(Incidents_2015)) %>%
slice_max(Incidents_2015, n = 10)
top10_2021 <- NYC_Dog_Bites2021 %>%
count(Breed, name = "Incidents_2021") %>%
arrange(desc(Incidents_2021)) %>%
slice_max(Incidents_2021, n = 10)
Combine the data TidyVerseEXTEND - Paula Brown
top_breeds_combined <- full_join(top10_2015, top10_2021, by = "Breed") %>%
pivot_longer(cols = starts_with("Incidents"),
names_to = "Year",
values_to = "Incidents") %>%
mutate(Year = ifelse(Year == "Incidents_2015", "2015", "2021"))
Remove NAs and only choose data with incidents TidyVerseEXTEND - Paula Brown
top_breeds_combined_clean <- top_breeds_combined %>%
filter(!is.na(Breed), !is.na(Incidents), Incidents > 0)
Here’s a bar chart comparing 2015 and 2021 counts side by side TidyVerseEXTEND - Paula Brown
ggplot(top_breeds_combined_clean, aes(x = fct_reorder(Breed, Incidents, sum),
y = Incidents, fill = Year)) +
geom_col(position = "dodge") +
coord_flip() +
labs(
title = "Top 10 Dog Breeds in Bite Incidents: 2015 vs 2021",
x = "Dog Breed",
y = "Number of Incidents",
fill = "Year"
) +
geom_text(aes(label = Incidents), position = position_dodge(width = 0.9), hjust = -0.1)+
theme_minimal()
###Interpretation of Top 10 Breeds in 2015 to Top 10 Breeds in 2021 New interpretation made by Paula Brown for TidyVerseEXTEND project
Significant Decrease
This is a nearly 88% decrease, which could reflect changes in ownership, breed classification, public policy, or reporting practices.
Notable Increases
Chihuahua: Rose from 56 to 133 incidents — a 138% increase.
Shih Tzu: Jumped from 58 to 120 incidents, more than doubling.
Yorkshire Terrier and Maltese also saw substantial increases, suggesting a rise in small breed incidents.
Relatively Stable
This stability might indicate consistent breed presence and behavior over time.
DogsBites_by_breed2015 <- NYC_Dog_Bites2015 %>%
count(Breed, name = "n_incidents_2015")
DogsBites_by_breed2021 <- NYC_Dog_Bites2021 %>%
count(Breed, name = "n_incidents_2021")
Breed_comparison <- full_join(DogsBites_by_breed2015, DogsBites_by_breed2021, by = "Breed") %>%
replace_na(list(n_incidents_2015 = 0, n_incidents_2021 = 0)) %>%
mutate(
diff = n_incidents_2021 - n_incidents_2015,
percent_change = ifelse(n_incidents_2015 == 0, NA, (diff / n_incidents_2015) * 100)
)
Here’s a quick table view of the top breeds with biggest increases/decreases TidyVerseEXTEND - Paula Brown
Breed_comparison %>%
select(Breed, n_incidents_2015, n_incidents_2021, diff, percent_change) %>%
filter(!is.na(percent_change)) %>%
mutate(percent_change = paste0(round(percent_change, 1), "%")) %>%
arrange(desc(percent_change))
## # A tibble: 440 × 5
## Breed n_incidents_2015 n_incidents_2021 diff percent_change
## <chr> <int> <int> <int> <chr>
## 1 Labradoodle 5 9 4 80%
## 2 TERRIER MIX 8 14 6 75%
## 3 COLLIE 3 5 2 66.7%
## 4 Belgian Malinois 4 6 2 50%
## 5 JACK RUSSELL MIX 2 3 1 50%
## 6 Rhodesian Ridgeback 2 3 1 50%
## 7 GOLDEN DOODLE 1 5 4 400%
## 8 HUSKY MIX 1 5 4 400%
## 9 MIXED BREED 1 38 37 3700%
## 10 Golden Retriever 16 22 6 37.5%
## # ℹ 430 more rows
head(10)
## [1] 10
Top 10 breeds with the largest percent increases in bite incidents from 2015 to 2021 TidyVerseEXTEND - Paula Brown
# Example: using your Breed_comparison table
top_increases <- Breed_comparison %>%
filter(!is.na(percent_change)) %>% # remove NA values
arrange(desc(percent_change)) %>%
slice_head(n = 10) %>% # top 10 increases
mutate(percent_label = paste0(round(percent_change, 1), "%"))
# Plot
ggplot(top_increases, aes(x = fct_reorder(Breed, percent_change),
y = percent_change)) +
geom_col(fill = "steelblue") +
geom_text(aes(label = percent_label),
hjust = -0.1, size = 3.5) +
coord_flip() +
labs(
title = "Top 10 Dog Breeds by Increase in Bite Incidents (2015–2021)",
x = "Dog Breed",
y = "Percent Change"
) +
theme_minimal()
###Interpretation of Top Dog Breeds by Increase in Bite Incidents
(2015-2021) Huge Growth in incidents New interpretation made by Paula
Brown for TidyVerseEXTEND project
Mixed Breed: A staggering 3700% increase — from 1 incident in 2015 to 38 in 2021.
Mixed: Jumped 1500%, suggesting a major rise in bites from dogs labeled as mixed breed.
Pitbull Mix: Up 1100%, reinforcing the trend seen in mixed classifications.
These massive jumps likely reflect:
Increased ownership of mixed breeds
Changes in breed classification or reporting
Possibly greater public awareness or documentation of incidents
Moderate but Significant Increases
Golden Doodle and Husky Mix: Both rose 400%, from 1 to 5 incidents.
Dachshund, Poodle X, Poodle Mix, Sheperd, and Corgi: All saw increases between 233% and 300%.
Overall, the graph shows that dogs that are spayed or Neutered are less likely involved in New York City dogs bites incidents.In fact, only 2 out of the top 10 Breeds by count that are Spayed or Neutered have an incident proportion higher than 25%.
New Conclusion made by Paula Brown for TidyVerseEXTEND project
Between 2015 and 2021, dog bite incidents in NYC shifted significantly in both breed composition and incident frequency:
Large Breeds Declined
Pit Bull bites dropped dramatically (from 521 to 64), suggesting either reduced ownership, improved management, or changes in breed classification.
Other traditionally high-incident breeds like German Shepherd and Rottweiler remained stable or declined.
Small and Mixed Breeds Surged
Breeds like Chihuahua, Shih Tzu, Yorkshire Terrier, and Maltese saw sharp increases in raw incident counts.
Percent change analysis revealed explosive growth in Mixed, Pitbull Mix, Golden Doodle, and Corgi bites — some increasing over 1000%.