NYC Crash Data

NYPD Motor Vehicle Collisions - Crashes (2013-2020)

The Dataset used for the following series of visualizations is downloaded from: https://opendata.cityofnewyork.us/data/. As displayed, the dataset contains a multitude of variables related to car accidents in NY’s 5 boroughs (Bronx, Brooklyn, Manhattan, Queens, Staten Island). For the purpose of more specific analysis, the data for these visualizations has been drawn from the columns: CRASH.DATE, LATITUDE/LONGITUDE, NUMBER.OF.PERSONS.INJURED, NUMBER.OF.PERSONS.KILLED, and CONTRIBUTING.FACTOR.VEHICLE.1. When looking at a dataset with a real-world impact on peoples safety, it was found that a focus on location, as well as the causes of the accidents would be the most meaningful to finding particularly dangerous areas in NYC.

The following visualizations seek to answer:

Where is the largest concentration of car crashes in NYC?
Is the time of year a determinate of the accidents in these areas?
What are the main reasons/causes of these crashes?
Are any reasons unique to location?
Where do the most dangerous collisions occur/why?

As seen, there is a large focus on location in this analysis. This is a result of the assumption that some roads as well as more general regions are more dangerous in terms of traffic. Identifying these regions and eventually more specific locations can be helpful in preventing future collisions. This is simply a preliminary analysis, which will allow further examination into these locations, traffic control measures, and road structures; in hope that these collisions are not by chance and are in fact preventable to a degree.

filename <- read.csv("/Users/dylanwilsberg/CrashData/Motor_Vehicle_Crash.csv")
df <- filename

colnames(df)

##  [1] "CRASH.DATE"                    "CRASH.TIME"                   
##  [3] "BOROUGH"                       "ZIP.CODE"                     
##  [5] "LATITUDE"                      "LONGITUDE"                    
##  [7] "LOCATION"                      "ON.STREET.NAME"               
##  [9] "CROSS.STREET.NAME"             "OFF.STREET.NAME"              
## [11] "NUMBER.OF.PERSONS.INJURED"     "NUMBER.OF.PERSONS.KILLED"     
## [13] "NUMBER.OF.PEDESTRIANS.INJURED" "NUMBER.OF.PEDESTRIANS.KILLED" 
## [15] "NUMBER.OF.CYCLIST.INJURED"     "NUMBER.OF.CYCLIST.KILLED"     
## [17] "NUMBER.OF.MOTORIST.INJURED"    "NUMBER.OF.MOTORIST.KILLED"    
## [19] "CONTRIBUTING.FACTOR.VEHICLE.1" "CONTRIBUTING.FACTOR.VEHICLE.2"
## [21] "CONTRIBUTING.FACTOR.VEHICLE.3" "CONTRIBUTING.FACTOR.VEHICLE.4"
## [23] "CONTRIBUTING.FACTOR.VEHICLE.5" "COLLISION_ID"                 
## [25] "VEHICLE.TYPE.CODE.1"           "VEHICLE.TYPE.CODE.2"          
## [27] "VEHICLE.TYPE.CODE.3"           "VEHICLE.TYPE.CODE.4"          
## [29] "VEHICLE.TYPE.CODE.5"

library(leaflet)
library(ggplot2)
library(dplyr)
library(lubridate)
library(scales)
library(ggthemes)
library(RColorBrewer)
library(cowplot)
library(lubridate)
x <- mdy(df$CRASH.DATE)


df$YEAR <- year(x)
df$MONTH <- month(x)
df$MONTHname <- months(x, abbreviate = TRUE)

#MULTI BAR SETUP

Borough_df <- df %>%
  select(BOROUGH, YEAR) %>%
  group_by(YEAR, BOROUGH) %>%
  summarise(n = length(BOROUGH), .groups='keep') %>%
  data.frame()

Borough_bar1 <- Borough_df %>% filter(BOROUGH %in% c("BRONX", "BROOKLYN", "MANHATTAN", "QUEENS", "STATEN ISLAND")) %>% data.frame()

Borough_bar2 <- Borough_bar1 %>% filter(YEAR %in% c("2017", "2018", "2019", "2020")) %>% data.frame()

Borough_bar2$YEAR <- factor(Borough_bar2$YEAR)

#LINE CHART SETUP

Injuries_df <- df %>%
  select(BOROUGH, YEAR, MONTH) %>%
  group_by(YEAR, MONTH, BOROUGH) %>%
  summarise(n = length(MONTH), .groups='keep') %>%
  data.frame()

Injuries_df$YEARMONTH <- paste(Injuries_df$YEAR, Injuries_df$MONTH, sep = '_')

Injuriesfilter1 <- Injuries_df %>% filter(BOROUGH %in% c("BRONX", "BROOKLYN", "MANHATTAN", "QUEENS", "STATEN ISLAND")) %>% data.frame()

Injuriesfilter2 <- Injuriesfilter1 %>% filter(YEAR %in% c("2017", "2018", "2019", "2020")) %>% data.frame()

yearmonth_order <- factor(Injuriesfilter2$YEARMONTH, level=c('2017_1', '2017_2', '2017_3', '2017_4', '2017_5', '2017_6', '2017_7', '2017_8', '2017_9', 
                                                             '2017_10', '2017_11', '2017_12','2018_1', '2018_2', '2018_3', '2018_4', '2018_5', 
                                                             '2018_6', '2018_7', '2018_8', '2018_9', '2018_10', '2018_11', '2018_12', '2019_1', 
                                                             '2019_2', '2019_3', '2019_4', '2019_5', '2019_6', '2019_7', '2019_8', '2019_9', 
                                                             '2019_10', '2019_11', '2019_12', '2020_1', '2020_2', '2020_3', '2020_4', '2020_5', 
                                                             '2020_6', '2020_7', '2020_8', '2020_9', 
                                                             '2020_10', '2020_11', '2020_12'))


new_x_labels <- c("2017","","","","","","","","","","","","2018","","","","","","","","","","","","2019","","","","","","","","","","","","2020","","","","","","","","","","","","2021")

#BAR CHART OF DEATH FACTORS SETUP

Factors_df <- df %>% filter(YEAR %in% c('2016','2017','2018','2019','2020')) %>% 
  filter(NUMBER.OF.PERSONS.KILLED > 0) %>% 
  filter(CONTRIBUTING.FACTOR.VEHICLE.1 !='Unspecified') %>% 
  filter(CONTRIBUTING.FACTOR.VEHICLE.1 != "") %>%
  filter(CONTRIBUTING.FACTOR.VEHICLE.1 %in% c('Unsafe Speed','Driver Inattention/Distraction','Failure to Yield Right-of-Way', 'Traffic Control Disregarded', 
                                              'Pedestrian/Bicyclist/Other Pedestrian Error/Confusion','Alcohol Involvement','Illness','Backing Unsafely', 
                                              'Driver Inexperience','Lost Consciousness','Passing or Lane Usage Improper')) %>%
  group_by(CONTRIBUTING.FACTOR.VEHICLE.1) %>%
  summarize(count = n()) %>%
  arrange(count) 



#MAP PLOT SETUP/FILTERS

Map_df <- df %>% select(NUMBER.OF.PERSONS.INJURED, BOROUGH, YEAR, MONTH, LATITUDE, LONGITUDE) %>%
  filter(NUMBER.OF.PERSONS.INJURED == 1) %>%
  filter(BOROUGH %in% c('MANHATTAN', 'QUEENS', 'BROOKLYN', 'BRONX', 'STATEN ISLAND')) %>%
  filter(YEAR %in% c('2019')) %>%
  filter(MONTH %in% c('6')) %>% data.frame()

Map_df1 <- na.omit(Map_df)

Map_df2 <- df %>% select(NUMBER.OF.PERSONS.INJURED, BOROUGH, YEAR, MONTH, LATITUDE, LONGITUDE) %>%
  filter(NUMBER.OF.PERSONS.INJURED == 2) %>%
  filter(BOROUGH %in% c('MANHATTAN', 'QUEENS', 'BROOKLYN', 'BRONX', 'STATEN ISLAND')) %>%
  filter(YEAR %in% c('2019')) %>%
  filter(MONTH %in% c('6')) %>% data.frame()

Map_df3 <- df %>% select(NUMBER.OF.PERSONS.INJURED, BOROUGH, YEAR, MONTH, LATITUDE, LONGITUDE) %>%
  filter(NUMBER.OF.PERSONS.INJURED == 3) %>%
  filter(BOROUGH %in% c('MANHATTAN', 'QUEENS', 'BROOKLYN', 'BRONX', 'STATEN ISLAND')) %>%
  filter(YEAR %in% c('2019')) %>%
  filter(MONTH %in% c('6')) %>% data.frame()

Map_df4 <- df %>% select(NUMBER.OF.PERSONS.INJURED, BOROUGH, YEAR, MONTH, LATITUDE, LONGITUDE) %>%
  filter(NUMBER.OF.PERSONS.INJURED == 4:6) %>%
  filter(BOROUGH %in% c('MANHATTAN', 'QUEENS', 'BROOKLYN', 'BRONX', 'STATEN ISLAND')) %>%
  filter(YEAR %in% c('2019')) %>%
  filter(MONTH %in% c('6')) %>% data.frame()

Map_df5 <- df %>% select(NUMBER.OF.PERSONS.INJURED, BOROUGH, YEAR, MONTH, LATITUDE, LONGITUDE) %>%
  filter(NUMBER.OF.PERSONS.INJURED > 6) %>%
  filter(BOROUGH %in% c('MANHATTAN', 'QUEENS', 'BROOKLYN', 'BRONX', 'STATEN ISLAND')) %>%
  filter(YEAR %in% c('2019')) %>%
  filter(MONTH %in% c('6')) %>% data.frame()

#HORIZONTAL BAR SETUP

stack_df <- df %>%
  select(BOROUGH, CONTRIBUTING.FACTOR.VEHICLE.1, NUMBER.OF.PERSONS.INJURED) %>%
  filter(BOROUGH %in% c("BRONX", "BROOKLYN", "MANHATTAN", "QUEENS", "STATEN ISLAND")) %>%
  filter(NUMBER.OF.PERSONS.INJURED >= 1) %>%
  filter(CONTRIBUTING.FACTOR.VEHICLE.1 %in% c('Unsafe Speed', 'Traffic Control Disregarded', 
                                                                                        'Pedestrian/Bicyclist/Other Pedestrian Error/Confusion','Alcohol Involvement','Aggressive Driving/Road Rage','Backing Unsafely', 
                                                                                        'Driver Inexperience','Lost Consciousness','Passing or Lane Usage Improper')) %>%
  group_by(CONTRIBUTING.FACTOR.VEHICLE.1, BOROUGH) %>%
  summarise(n = length(NUMBER.OF.PERSONS.INJURED), .groups = 'keep') %>% data.frame()

Multi-Bar Chart: Crashes by Borough by Year

This first graph illustrates the counts of car accidents in NYC by each individual borough. Through a multi-bar chart the data is split up into the 4 most recent years. Throughout these 4 years it is evident that the proportions of borough’s crash counts remained fairly constant and varied little over the years. Brooklyn and the Queens headed the grouping, while Staten Island posted the least amount of crashes. This may be a result of Staten Island receiving less overall traffic. In 2020, all Boroughs saw significant decreases in accidents, most likely as a result of the pandemic.

plot1 <- ggplot(Borough_bar2, aes(x = BOROUGH, y = n, fill=BOROUGH)) +
  geom_bar(stat="identity", position="dodge") +
  scale_fill_brewer(palette = "Spectral") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_y_continuous(labels = comma) +
  labs(title = "Crashes by Borough by Year",
       x = "Borough", 
       y = "Crash Count",
       fill = "Borough") +
  facet_wrap(~YEAR, ncol = 2, nrow = 2)
plot1

Line-Chart: Crashes by Borough by Month/Year

Here we take a look at the different boroughs via line chart. Although similar to the previous plot in nature, this gives us a look into how crashes are occurring over time. There is a semi-cyclic pattern that is forming, where the peaks tend to occur at the May/June months. Similar to the last plot, Staten Island remains fairly constant with less accidents validating that the reason most likely being a lack of overall traffic in comparison. Also, a large dip in 2020 represents the beginning of covid-19. What comes as a surprise is that one would expect a large amount of accidents to occur in the winter months where there is more treacherous conditions on the road. This shows that, winter conditions don’t play as big of a factor as we may think, and that less people are actively on the roads during these winter months as a result. The correlation between the boroughs with exception of Staten Island is also glaring.

plot2 <- ggplot(Injuriesfilter2, aes(x = yearmonth_order, y = n, group=BOROUGH)) +
  geom_line(aes(color=BOROUGH), size=1.5) +
  labs(title = "Crashes by Borough by Month",
      x = "Date",
      y = "Crash Count",
      fill = "Borough") +
  theme(plot.title = element_text(hjust=0.5)) +
  scale_y_continuous(labels = comma) +
  scale_color_brewer(palette = "Spectral", name = "Borough") +
  scale_x_discrete(labels= new_x_labels)
 plot2

Bar-Chart: Top 10 Contributing Factors (Deaths)

Here a bar-chart displays the top ten factors that contribute to deaths in accidents. In order to determine the root of the most fatal accidents in the data set, we must look to the contributing factors column. Although there is a progression of factors, the factors are fairly well distributed in that no one factor is overwhelming/glaring. It is also important to look deeper into these results as many of the accident causes can be preventable and are not so much as “accidents.” Some of these preventable or foreseeable causes of accidents include; Unsafe Speed, Alcohol Involvement, and Driver Inexperience. Measures could include, but are not limited to frequent speeding cameras/large fines, breathalyzer activated ignitions, and making the process to receiving licenses more strenuous.

plot3 <- ggplot(Factors_df, aes(x = reorder(CONTRIBUTING.FACTOR.VEHICLE.1, -count), y=count, fill=CONTRIBUTING.FACTOR.VEHICLE.1)) +
  geom_bar(stat="identity") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
  ggtitle('Top 10 Contributing Factor to Deaths') +
  scale_fill_brewer(palette = "Spectral") +
  theme(plot.title = element_text(hjust = 0.5)) +
  xlab("Contributing Factor") +
  ylab("Death Count") 
plot3

Horizontal Stacked-Bar: Crashes by Factor for Months of the Year

Following up with the previous bar-chart, this horizontal format takes a closer look into the individual Boroughs. A large purpose of this analysis is to see what is occurring within each borough of NYC, so it is important to see if some of these top factors noted in the last graph are specific to each borough. Here it is evident that these top reasons are not necessarily unique to each borough, though they do show slightly different builds in in terms of their proportions. Looking more specifically, Manhattan seems to have a fairly constant spread across factors while other boroughs, such as Brooklyn which sees a near 100% jump in accidents related to “Traffic Control Disregarded” in comparison to other factors.

plot5 <- ggplot(stack_df, aes(x = reorder(CONTRIBUTING.FACTOR.VEHICLE.1, +n), y = n, fill = BOROUGH)) +
  geom_bar(stat = "identity") +
  scale_fill_brewer(palette = "Spectral") +
  coord_flip() +
  labs(title = "Crashes by Factor for Months of the Year", x = "Contributing Factors", y = "Crash Injury Count") +
  scale_y_continuous(labels = comma) 
plot5

Map Plot: Locations of Crashes by Injury Amount

Here a Map portrays the locations of various crashes in June of 2019. This specific month was chosen as a limit to the amount of points, this particular time frame is also a peak in crashes (plot2). The points are determined accurately via Longitude and Latitude coordinates, and are individually color/size coded by the size of the crash (determined by persons injured). Manhattan sees the largest concentration of small crashes (1 injury) displayed by the dark mass of small points in lower Manhattan. Queens/Brooklyn sees the highest amount or larger crashes. This is interesting, but not necessarily surprising, as a result of the average speed limits in these boroughs. Taking a closer look at where some of the largest crashes occurred, it is common so see a large intersection where there is an extremely high concentration of crashes with nearly matching coordinates.

m <- leaflet() %>%
  addTiles() %>%
  addCircles(lng = Map_df1$LONGITUDE, lat = Map_df1$LATITUDE, opacity = .63, color = "black", label = "Injuries: 1") %>%
  addCircles(lng = Map_df2$LONGITUDE, lat = Map_df2$LATITUDE, opacity = .75, color = "dodgerblue", radius = 100, label = "Injuries: 2") %>%
  addCircles(lng = Map_df3$LONGITUDE, lat = Map_df3$LATITUDE, opacity = .75, color = "chartreuse", radius = 200, label = "Injuries: 3") %>%
  addCircles(lng = Map_df4$LONGITUDE, lat = Map_df4$LATITUDE, opacity = .75, color = "darkorange", radius = 400, label = "Injuries: 4-6") %>%
  addCircles(lng = Map_df5$LONGITUDE, lat = Map_df5$LATITUDE, opacity = .75, color = "red", radius = 650, label = "Injuries: 7+")
m

Summary

Overall, it can be concluded that Brooklyn holds the highest amount of crashes in various categories, although the distribution of crashes in the boroughs may be less related to errors in traffic control/unsafe roads, and more related to more traffic/higher speed limits. Over time, all of the boroughs are highly correlated (with exception to Staten Island), and see a cyclic pattern in terms of crashes by month, where surprisingly the winter months post the least amount of crashes. The most deadly crashes are often a result of unsafe speed, driver distraction, and failure to yield right of way… all of which are preventable by the driver through carrying-out safe driving practices. Through use of the interactive map, groupings of crashes (coded by size/danger-level) are easily identifiable, where the points plotted are clumped in large intersections, and larger roads in general. In conclusion, the visualizations provided, present a well-rounded picture of where/why accidents are occurring in NYC over the last few years.