options(repos = c(CRAN = "https://cran.rstudio.com"))

Introduction

In order to create a Tidyverse Vignette, I will be using the kaggle source where i will look at all dogs’ bites that have been registered in New York City from 2015 to 2021.

Load the required package

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Let’s read the Data from GitHub

NYC_Dogs_Bites <- read_csv("https://raw.githubusercontent.com/Pascaltafo2025/DATA-607/main/Dog_Bites_Data.csv")
## Rows: 22663 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): DateOfBite, Species, Breed, Age, Gender, Borough, ZipCode
## dbl (1): UniqueID
## lgl (1): SpayNeuter
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(NYC_Dogs_Bites)
## # A tibble: 6 × 9
##   UniqueID DateOfBite      Species Breed Age   Gender SpayNeuter Borough ZipCode
##      <dbl> <chr>           <chr>   <chr> <chr> <chr>  <lgl>      <chr>   <chr>  
## 1        1 January 01 2018 DOG     UNKN… <NA>  U      FALSE      Brookl… 11220  
## 2        2 January 04 2018 DOG     UNKN… <NA>  U      FALSE      Brookl… <NA>   
## 3        3 January 06 2018 DOG     Pit … <NA>  U      FALSE      Brookl… 11224  
## 4        4 January 08 2018 DOG     Mixe… 4     M      FALSE      Brookl… 11231  
## 5        5 January 09 2018 DOG     Pit … <NA>  U      FALSE      Brookl… 11224  
## 6        6 January 03 2018 DOG     BASE… 4Y    M      FALSE      Brookl… 11231
colnames(NYC_Dogs_Bites)
## [1] "UniqueID"   "DateOfBite" "Species"    "Breed"      "Age"       
## [6] "Gender"     "SpayNeuter" "Borough"    "ZipCode"
sort(unique(year(mdy(NYC_Dogs_Bites$DateOfBite))))
## [1] 2015 2016 2017 2018 2019 2020 2021

Our dataset is very large because it contains the dogs bites in NYC from 2015 to 2021. I decided to conduct my analysis only in the year 2021.

Data Cleaning

In the section, i will column(s) that i believe are not relevant for the analysis, select only the records for my year of interest.

library(lubridate)

# Let's do our first clean up by selecting our year of interest and getting ride of the unnecessary columns. 

NYC_Dog_Bites2021 <- NYC_Dogs_Bites %>%
  select(-UniqueID,-Species,-ZipCode) %>% #remove UniqueID column
  mutate(Year = year(mdy(DateOfBite))) %>%  # convert the Date format MDY to only year
  filter(Year == 2021)  # keep only year 2021

head(NYC_Dog_Bites2021,10)
## # A tibble: 10 × 7
##    DateOfBite      Breed                   Age   Gender SpayNeuter Borough  Year
##    <chr>           <chr>                   <chr> <chr>  <lgl>      <chr>   <dbl>
##  1 January 03 2021 FRENCH BULL DOG & PIT … 4     F      FALSE      Brookl…  2021
##  2 January 02 2021 UNKNOWN                 <NA>  U      FALSE      Brookl…  2021
##  3 January 05 2021 Rhodesian Ridgeback     10    M      FALSE      Brookl…  2021
##  4 January 01 2021 MIXED BREED             5     F      FALSE      Brookl…  2021
##  5 January 03 2021 MIXED                   5     M      TRUE       Brookl…  2021
##  6 January 07 2021 Chihuahua Crossbreed    5     M      FALSE      Brookl…  2021
##  7 January 10 2021 Beagle                  9M    M      FALSE      Brookl…  2021
##  8 January 09 2021 Pit Bull                <NA>  U      FALSE      Brookl…  2021
##  9 January 09 2021 Maltese                 6     M      FALSE      Brookl…  2021
## 10 January 09 2021 MIXED BREED             3     F      TRUE       Brookl…  2021

DATA ANALYSIS

My goal is to compare dogs bites incidents by breed and check if there is a correlation between the dog breed and their Sterilization status.

1) Let’s find the number of incidents by breed.

# Data preparation for analysis

NYC_Dog_Bites2021 <- NYC_Dog_Bites2021 %>%
  filter(Breed != "UNKNOWN")  # remove the Unknown dogs breed

# Let's count the incidents by breed

DogsBites_by_breed <- NYC_Dog_Bites2021 %>%
  group_by(Breed) %>%
  summarise(n_incidents = n()) %>%
  arrange(desc(n_incidents))

DogsBites_by_breed
## # A tibble: 330 × 2
##    Breed             n_incidents
##    <chr>                   <int>
##  1 Pit Bull                  521
##  2 MIXED                     112
##  3 German Shepherd            85
##  4 Shih Tzu                   58
##  5 Chihuahua                  56
##  6 Rottweiler                 53
##  7 Yorkshire Terrier          46
##  8 Maltese                    41
##  9 MIXED BREED                38
## 10 Poodle, Standard           34
## # ℹ 320 more rows

This result shows that Pit Bull is the most mentioned breed with 521 cases of dogs bites registered in New York City in 2021 followed by the Mixed, the German Shepherd breeds respectively with 112 and 85 cases.

2) Let’s find the relationship between the dog breed and whether or not they are spayNeuter.

Breed_SpayNeuter <- NYC_Dog_Bites2021 %>%
  count(Breed, SpayNeuter) %>%
  group_by(Breed) %>%
  mutate(percent = 100 * n / sum(n)) %>%
  arrange(desc(n))

head(Breed_SpayNeuter,10)
## # A tibble: 10 × 4
## # Groups:   Breed [9]
##    Breed             SpayNeuter     n percent
##    <chr>             <lgl>      <int>   <dbl>
##  1 Pit Bull          FALSE        488   93.7 
##  2 MIXED             FALSE         88   78.6 
##  3 German Shepherd   FALSE         68   80   
##  4 Rottweiler        FALSE         48   90.6 
##  5 Chihuahua         FALSE         46   82.1 
##  6 Yorkshire Terrier FALSE         41   89.1 
##  7 Shih Tzu          FALSE         38   65.5 
##  8 MIXED BREED       FALSE         34   89.5 
##  9 Pit Bull          TRUE          33    6.33
## 10 Poodle, Standard  FALSE         32   94.1

DATA VISUALIZATION

We will generate a plot that will show the proportion of dog bites incidents by breed depending if they are Spayed/Neutered or not. Since we have too many breeds registered, we will on the top 10 most registered breeds.

# let's get the top 10 Breeds by count

top10_Breeds <- NYC_Dog_Bites2021 %>%
                filter(!is.na(Breed), 
                str_to_lower(Breed) != "unknown",
                !is.na(SpayNeuter)) %>%
                add_count(Breed, name = "breed_total") %>%
                filter(breed_total %in% tail(sort(unique(breed_total)), 10))

# Create plot

ggplot(data= top10_Breeds, aes(x = fct_reorder(Breed, Breed, length), y = 1, fill = SpayNeuter)) +
  geom_col(position = "fill") +
  coord_flip() +
  scale_y_continuous(labels = scales::percent) +
  labs(
    title = "Spay/Neuter Status by Top 5 Dog Breeds in Bite Incidents",
    x = "Dog Breed",
    y = "Proportion of Incidents",
    fill = "Spay/Neuter Status") +
  theme_minimal()

Compare 2021 dog bite data to 2015 dog bite data

Comparing 2015 data to 2021 data provided. I will filter to retrieve 2015 NYC dog bites. TidyVerseEXTEND - Paula Brown

NYC_Dog_Bites2015 <- NYC_Dogs_Bites %>%
  select(-UniqueID,-Species,-ZipCode) %>%
  mutate(Year = year(mdy(DateOfBite))) %>%
  filter(Year == 2015, Breed != "UNKNOWN")

Get Top10 Breeds

We retrieve top 10 breeds for 2015 and 2021 dog bites to combine them TidyVerseEXTEND - Paula Brown

top10_2015 <- NYC_Dog_Bites2015 %>%
  count(Breed, name = "Incidents_2015") %>%
  arrange(desc(Incidents_2015)) %>%
  slice_max(Incidents_2015, n = 10)

top10_2021 <- NYC_Dog_Bites2021 %>%
  count(Breed, name = "Incidents_2021") %>%
  arrange(desc(Incidents_2021)) %>%
  slice_max(Incidents_2021, n = 10)

Merge the datasets

Combine the data TidyVerseEXTEND - Paula Brown

top_breeds_combined <- full_join(top10_2015, top10_2021, by = "Breed") %>%
  pivot_longer(cols = starts_with("Incidents"),
               names_to = "Year",
               values_to = "Incidents") %>%
  mutate(Year = ifelse(Year == "Incidents_2015", "2015", "2021"))

Remove NAs and only choose data with incidents TidyVerseEXTEND - Paula Brown

top_breeds_combined_clean <- top_breeds_combined %>%
  filter(!is.na(Breed), !is.na(Incidents), Incidents > 0)

Visualize the difference

Here’s a bar chart comparing 2015 and 2021 counts side by side TidyVerseEXTEND - Paula Brown

ggplot(top_breeds_combined_clean, aes(x = fct_reorder(Breed, Incidents, sum), 
                                y = Incidents, fill = Year)) +
  geom_col(position = "dodge") +
  coord_flip() +
  labs(
    title = "Top 10 Dog Breeds in Bite Incidents: 2015 vs 2021",
    x = "Dog Breed",
    y = "Number of Incidents",
    fill = "Year"
  ) +
  geom_text(aes(label = Incidents), position = position_dodge(width = 0.9), hjust = -0.1)+
  theme_minimal()

###Interpretation of Top 10 Breeds in 2015 to Top 10 Breeds in 2021 New interpretation made by Paula Brown for TidyVerseEXTEND project

Significant Decrease

  • Pit Bull: Dropped dramatically from 521 incidents in 2015 to just 64 in 2021.

This is a nearly 88% decrease, which could reflect changes in ownership, breed classification, public policy, or reporting practices.

Notable Increases

  • Chihuahua: Rose from 56 to 133 incidents — a 138% increase.

  • Shih Tzu: Jumped from 58 to 120 incidents, more than doubling.

  • Yorkshire Terrier and Maltese also saw substantial increases, suggesting a rise in small breed incidents.

Relatively Stable

  • German Shepherd: Stayed nearly constant (85 → 87).

This stability might indicate consistent breed presence and behavior over time.

Count Incidents by Breed

DogsBites_by_breed2015 <- NYC_Dog_Bites2015 %>%
  count(Breed, name = "n_incidents_2015")

DogsBites_by_breed2021 <- NYC_Dog_Bites2021 %>%
  count(Breed, name = "n_incidents_2021")

Compare Breed

Breed_comparison <- full_join(DogsBites_by_breed2015, DogsBites_by_breed2021, by = "Breed") %>%
  replace_na(list(n_incidents_2015 = 0, n_incidents_2021 = 0)) %>%
  mutate(
    diff = n_incidents_2021 - n_incidents_2015,
    percent_change = ifelse(n_incidents_2015 == 0, NA, (diff / n_incidents_2015) * 100)
  )

Highlight Percent Differences

Here’s a quick table view of the top breeds with biggest increases/decreases TidyVerseEXTEND - Paula Brown

Breed_comparison %>%
  select(Breed, n_incidents_2015, n_incidents_2021, diff, percent_change) %>%
  filter(!is.na(percent_change)) %>%
  mutate(percent_change = paste0(round(percent_change, 1), "%")) %>%
  arrange(desc(percent_change))
## # A tibble: 440 × 5
##    Breed               n_incidents_2015 n_incidents_2021  diff percent_change
##    <chr>                          <int>            <int> <int> <chr>         
##  1 Labradoodle                        5                9     4 80%           
##  2 TERRIER MIX                        8               14     6 75%           
##  3 COLLIE                             3                5     2 66.7%         
##  4 Belgian Malinois                   4                6     2 50%           
##  5 JACK RUSSELL MIX                   2                3     1 50%           
##  6 Rhodesian Ridgeback                2                3     1 50%           
##  7 GOLDEN DOODLE                      1                5     4 400%          
##  8 HUSKY MIX                          1                5     4 400%          
##  9 MIXED BREED                        1               38    37 3700%         
## 10 Golden Retriever                  16               22     6 37.5%         
## # ℹ 430 more rows
head(10)
## [1] 10

Top 10 Breeds with Largest Percent Increase

Top 10 breeds with the largest percent increases in bite incidents from 2015 to 2021 TidyVerseEXTEND - Paula Brown

# Example: using your Breed_comparison table
top_increases <- Breed_comparison %>%
  filter(!is.na(percent_change)) %>%                # remove NA values
  arrange(desc(percent_change)) %>%
  slice_head(n = 10) %>%                            # top 10 increases
  mutate(percent_label = paste0(round(percent_change, 1), "%"))

# Plot
ggplot(top_increases, aes(x = fct_reorder(Breed, percent_change),
                          y = percent_change)) +
  geom_col(fill = "steelblue") +
  geom_text(aes(label = percent_label),
            hjust = -0.1, size = 3.5) +
  coord_flip() +
  labs(
    title = "Top 10 Dog Breeds by Increase in Bite Incidents (2015–2021)",
    x = "Dog Breed",
    y = "Percent Change"
  ) +
  theme_minimal()

###Interpretation of Top Dog Breeds by Increase in Bite Incidents (2015-2021) Huge Growth in incidents New interpretation made by Paula Brown for TidyVerseEXTEND project

  • Mixed Breed: A staggering 3700% increase — from 1 incident in 2015 to 38 in 2021.

  • Mixed: Jumped 1500%, suggesting a major rise in bites from dogs labeled as mixed breed.

  • Pitbull Mix: Up 1100%, reinforcing the trend seen in mixed classifications.

These massive jumps likely reflect:

  • Increased ownership of mixed breeds

  • Changes in breed classification or reporting

  • Possibly greater public awareness or documentation of incidents

Moderate but Significant Increases

  • Golden Doodle and Husky Mix: Both rose 400%, from 1 to 5 incidents.

  • Dachshund, Poodle X, Poodle Mix, Sheperd, and Corgi: All saw increases between 233% and 300%.

INTERPRETATION

Overall, the graph shows that dogs that are spayed or Neutered are less likely involved in New York City dogs bites incidents.In fact, only 2 out of the top 10 Breeds by count that are Spayed or Neutered have an incident proportion higher than 25%.

Overall Conclusion

New Conclusion made by Paula Brown for TidyVerseEXTEND project

Between 2015 and 2021, dog bite incidents in NYC shifted significantly in both breed composition and incident frequency:

Large Breeds Declined

Small and Mixed Breeds Surged