New York City Traffic Accident Dataset

Dataset Background

This dataset was downloaded from NYC Open Data and contains over 2 million rows of motor vehicle collision information beginning in 2012 through to when it was downloaded to use on January 26, 2024.

One important point about this data: it only includes accidents that caused over $1,000 in damages and were reported to authorities and documented.

Columns Used:

Crash Date
Crash Time
Location of Crash - Latitude, Longitude, Borough, On-street name
Causes of Crash
Injury and Fatality Rates

Findings

Manhattan experiences the highest per capita rate of accidents.
The primary culprit behind over 70% of accidents are Reckless Driving or Distracted Driving.
Accidents are more prevalent in the summer and early fall months and Fridays tend to be the day with the greatest number of injuries and fatalities.
Accident counts saw a significant decrease following the COVID-19 shutdown and have yet to return to their pre-COVID levels.

# import libraries
library(data.table)
library(DescTools)
library(ggplot2)
library(lubridate)
library(scales)
library(RColorBrewer)
library(ggthemes)
library(plyr)
library(dplyr)
library(ggrepel)
library(stringr)
library(leaflet)
library(tidyr)
library(plotly)

file <- "U:/Motor_Vehicle_Collisions.csv"
df = fread(file)

NYC Traffic Accident Exploration

# Filter out rows with missing values in relevant columns
df <- df[!(is.na(df$`NUMBER OF PERSONS INJURED`) | is.na(df$`NUMBER OF PERSONS KILLED`)), ]
df <- df %>%
  rename(crash_date = `CRASH DATE`, STREET = `ON STREET NAME`)

# Extract relevant date and time information
x <- mdy(df$crash_date)
df$day <- weekdays(x, abbreviate = TRUE)
df$date <- day(x)
df$month <- month(x, label = TRUE, abbr = TRUE)
df$year <- year(x)
df$time <-  as.POSIXct(df$`CRASH TIME`, format="%H:%M")
df$hour <- hour(df$time)
df$day_or_night <- ifelse(df$hour <= 8 | df$hour >= 20, "night", "day")

Accident Hotspots

Top Accident Hotspots: Mapping High-Concentration Locations (2012-2024)

# Identify top locations based on accidents count
top_locations <- df[!(is.na(df$LATITUDE) | is.na(df$LONGITUDE)), ] %>%
  select(BOROUGH, LATITUDE, LONGITUDE, STREET) %>%
  mutate(LATITUDE = round(LATITUDE, 2),
         LONGITUDE = round(LONGITUDE, 2)) %>%
  group_by(LATITUDE, LONGITUDE, BOROUGH, STREET) %>%
  tally() %>%
  arrange(-n) %>%
  head(100) %>%
  data.frame()

# Aggregate data by latitude and longitude, summing up accidents
aggregated_locations <- top_locations %>%
  group_by(LATITUDE, LONGITUDE) %>%
  summarise(NumAccidents = sum(n),
            BOROUGH = first(BOROUGH),
            STREET = first(STREET))

# Create a leaflet map with red circles representing accident locations
leaflet(top_locations) %>%
  addTiles() %>%
  addCircles(
    lng = ~LONGITUDE,
    lat = ~LATITUDE,
    radius = ~sqrt(n) * 5,
    color = "red",
    fillOpacity = 0.7,
    popup = ~paste("Borough:", str_to_title(BOROUGH),
                   "<br>Accidents:", comma(n),
                   street =ifelse((STREET==""), "", paste("<br>Street Name:",
                                                          str_to_title(STREET))))
  )

Explanation of Visualization: This is an interactive map that visually represents the top 100 locations with the highest concentration of accidents between 2012 and 2024. The size of the red circles corresponds to the number of accidents at each location.

Data Preparation: When clicking the red circles, you’ll find detailed information, including the Borough and Accident Count for each location. However, the inclusion of Street names is not always seen, as this information isn’t consistently recorded in the dataset.

Key Observations: Based on this map, Lower Manhattan has a large concentration of accidents, especially around Times Square, Broadway, and other popular tourist destinations.

Accident Count Over the Years

# Create a data frame for counting accidents by year and month
year_df <- df %>%
  select(year, month) %>%
  group_by(year, month) %>%
  filter(year != 2024) %>%
  tally() %>%
  data.frame() %>%
  mutate(year = factor(year))

# Create a ggplot bar chart for total accidents by month and year
ggplot(year_df, aes(x=month, y = n, fill = year)) +
  geom_bar(stat="identity", position="dodge") +
  theme_light() +
  theme(plot.title = element_text(hjust = 0.5),
        legend.position = "none") +
  scale_y_continuous(labels = comma) +
  labs(title = "Multiple Bar Charts - Total Accidents by Month by Year",
       x= "Months",
       y = "Accident Count",
       fill = "Year") +
  scale_fill_brewer(palette = "Paired") +
  facet_wrap(~year, ncol=4, nrow=3)

Explanation of Visualization: This visualization illustrates the total number of accidents that occurred each month over the years 2012-2023 using grouped bar charts. The x-axis represents the months and the y-axis represents the total accident count.

Data Preparation: This only includes 2012-2023. Since 2024 was not a complete year I chose to use only the 12 years since that also created a 4x3 trellis.

Key Observations: There is a slight increase in accident totals in the middle of the year and the overall decrease in total accident numbers beginning April 2020.

Also note that 2012 is not a complete year - it’s data begins in July. This could also have an effect on subsequent charts where aggregated accident counts are shown and could cause the first 6 months of the year to seem lower than in actuality.

By Borough

df_borough <- df %>%
  select(BOROUGH, month) %>%
  filter(BOROUGH != "") %>%
  group_by(month, BOROUGH) %>%
  tally() %>%
  data.frame()

# Add column to borough data frame to control for population in the boroughs
borough_pop <- c(1379946, 2590516, 1596273, 2278029, 491133)
df_borough$population <- borough_pop
df_borough$BOROUGH <- str_to_title(df_borough$BOROUGH)

# Plot accidents per capita by month and borough
ggplot(df_borough, aes(x=month, y=n/population *100, group=BOROUGH)) +
  geom_line(aes(color=BOROUGH), linewidth=4) +
  labs(title="Accidents Per Capita by Month and Borough", x="Months", y="Accidents Per Capita") +
  theme_light() +
  geom_point(shape=21, size=5, color="black", fill="white") +
  scale_y_continuous(labels=comma) +
  scale_color_brewer(palette = "Set1", name = "Borough") +
  theme(plot.title = element_text(hjust=0.5))

Explanation of Visualization: This visualization shoes the variation in accident rates per capita across the 5 NYC boroughs over 12 months, including data from all available years.

Data Preparation: When initially creating this graph, I did not control for population and was surprised by the initial results which had Brooklyn and Queens with more accidents than Manhattan. The controlling by population allows for a more insightful comparison between boroughs.

Key Observations: This line graph follows a similar trend as the grouped bar graphs. February and April are the months with the fewest accidents in all boroughs and the number of accidents increases throughout the summer with the highest rate occurring in October for 3 of the 5 boroughs.

Contributing Factors

# Create a factor count data frame
factorcount = data.frame(count(df, `CONTRIBUTING FACTOR VEHICLE 1`))
factorcount = factorcount[order(factorcount$n, decreasing = TRUE),]
factorcount <- factorcount %>%
  rename(CONTRIBUTING_FACTOR = `CONTRIBUTING.FACTOR.VEHICLE.1`)

# Combine and categorize contributing factors
combinedfactors <- factorcount %>%
  mutate(CONTRIBUTING_FACTOR = case_when(
    CONTRIBUTING_FACTOR %in% c("Unspecified", "") ~ "Unspecified",
    CONTRIBUTING_FACTOR %in% c("1", "80", "Other Vehicular", "Vehicle Vandalism", "Driverless/Runaway Vehicle") ~ "Other",
    CONTRIBUTING_FACTOR %in% c("Brakes Defective", "Steering Failure", "Tire Failure/Inadequate",
                               "Accelerator Defective", "Tow Hitch Defective", "Other Lighting Defects",
                               "Tinted Windows", "Headlights Defective", "Windshield Inadequate", "Oversized Vehicle") ~ "Vehicle Defective",
    CONTRIBUTING_FACTOR %in% c("Drugs (Illegal)", "Drugs (illegal)", "Prescription Medication",
                               "Alcohol Involvement") ~ "Under the Influence",
    CONTRIBUTING_FACTOR %in% c("Fatigued/Drowsy", "Lost Consciousness", "Physical Disability",
                               "Fell Asleep", "Illness", "Illnes") ~ "Medical Issue",
    CONTRIBUTING_FACTOR %in% c("Listening/Using Headphones", "Texting", "Cell Phone (hand-held)",
                               "Using On Board Navigation Device", "Cell Phone (hands-free)",
                               "Cell Phone (hand-Held)", "Other Electronic Device","Eating or Drinking", "Outside Car Distraction",
                               "Passenger Distraction") ~ "Driver Inattention/Distraction",
    CONTRIBUTING_FACTOR %in% c("Pavement Defective", "Traffic Control Device Improper/Non-Working",
                               "Pavement Slippery", "Obstruction/Debris", "Lane Marking Improper/Inadequate",
                               "Shoulders Defective/Improper", "Glare", "View Obstructed/Limited") ~ "Environmental Factors",
    CONTRIBUTING_FACTOR %in% c("Traffic Control Disregarded", "Failure to Yield Right-of-Way", "Backing Unsafely", "Passing Too Closely",
                               "Aggressive Driving/Road Rage", "Passing or Lane Usage Improper",
                               "Turning Improperly", "Unsafe Lane Changing", "Unsafe Speed",
                               "Failure to Keep Right", "Following Too Closely") ~ "Reckless Driver Behavior",
    CONTRIBUTING_FACTOR %in% c("Pedestrian/Bicyclist/Other Pedestrian Error/Confusion", "Animals Action", 
                               "Reaction to Other Uninvolved Vehicle", "Reaction to Uninvolved Vehicle") ~ "Reactions to Environment",
    TRUE ~ as.character(CONTRIBUTING_FACTOR)
    ))

# Create data frame for overall contributing factors
df_combined <- combinedfactors %>%
  group_by(CONTRIBUTING_FACTOR) %>%
  summarize(n = sum(n))%>%
  filter(CONTRIBUTING_FACTOR != "Unspecified") %>%
  arrange(desc(n)) %>%
  mutate(percentage = round(100*n/sum(n),1)) %>%
  data.frame()

# Create data frames for reckless and distraction contributing factors
reckless_data <- factorcount %>%
  select(CONTRIBUTING_FACTOR, n) %>%
  filter(CONTRIBUTING_FACTOR %in% c("Traffic Control Disregarded", "Failure to Yield Right-of-Way", "Backing Unsafely", "Passing Too Closely",
                                    "Aggressive Driving/Road Rage", "Passing or Lane Usage Improper",
                                    "Turning Improperly", "Unsafe Lane Changing", "Unsafe Speed",
                                    "Failure to Keep Right", "Following Too Closely")) %>%
  group_by(CONTRIBUTING_FACTOR) %>%
  summarize(n = sum(n))%>%
  arrange(desc(n)) %>%
  mutate(reck_percentage = round(100*n/sum(n),1)) %>%
  data.frame()

distraction_data <- factorcount %>%
  select(CONTRIBUTING_FACTOR, n) %>%
  filter(CONTRIBUTING_FACTOR %in% c("Listening/Using Headphones", "Texting", "Cell Phone (hand-held)",
                                    "Using On Board Navigation Device", "Cell Phone (hands-free)",
                                    "Cell Phone (hand-Held)", "Other Electronic Device","Eating or Drinking", "Outside Car Distraction",
                                    "Passenger Distraction")) %>%
  group_by(CONTRIBUTING_FACTOR) %>%
  summarize(n = sum(n))%>%
  arrange(desc(n)) %>%
  mutate(reck_percentage = round(100*n/sum(n),1)) %>%
  data.frame()

# Create a pie chart for specified contributing factors
plot_ly() %>%
  add_pie(data = distraction_data, textposition = "inside", labels = ~CONTRIBUTING_FACTOR, values = ~n, 
          name = "Distraction",marker = list(colors = RColorBrewer::brewer.pal(9, "Blues")), hole = 0.6,
          domain = list(row = 1, column = 0)) %>%
  add_pie(data = reckless_data, textposition = "inside", labels = ~CONTRIBUTING_FACTOR, values = ~n, 
          name = "Reckless", marker = list(colors = RColorBrewer::brewer.pal(9, "Blues")), hole = 0.6,
          domain = list(row = 1, column = 2)) %>%
  add_pie(data = df_combined, textposition = "inside", labels = ~CONTRIBUTING_FACTOR, values = ~n, 
          name = "All Factors", marker = list(colors = RColorBrewer::brewer.pal(12, "Paired")), hole = 0.6,
          domain = list(row = 1, column = 1)) %>%
  layout(title = "Distribution of Specified Contributing Factors to Vehicle Accidents",
         grid = list(rows = 1, columns = 3),
         showlegend = FALSE,
         annotations= list(
           list(y=0.50, x=0.4952, text="All Factors", showarrow=FALSE, font=list(size=15)),
           list(y=0.50, x=0.085, text="Distracted Driver<br>Factors", showarrow=FALSE, font=list(size=15)),
           list(y=0.50, x=0.915, text="Reckless Driver<br>Factors", showarrow=FALSE, font=list(size=15))))

Explanation of Visualization: This trellis donut chart provides a view of the distribution of contributing factors to vehicle accidents. The center chart shows the total count of contributing factors and the left and right charts break down the two most common factors - reckless driver behavior and distraction-related factors.

Data Preparation: The dataset allows for the input of 5 contributing factors per accident, I only focused on Contributing Factor 1 as it came with the most filled in fields. Additionally, over one third of accidents in the dataset did not have a specified contributing factor, leading to their exclusion.

Originally, there were nearly 70 distinct contributing factors. I chose to categorize them into broader themes (for example: combining “Drugs (Illegal)”, “Drugs (illegal)”, “Prescription Medication”, “Alcohol Involvement” into one category of “Under the Influence”).

Key Observations: The key contributing factors in accidents were “Distracted Driver Factors” (32%) and “Reckless Driver Factors” (43.4%). Almost all distracted or inattentive driving factors were due to outside car distraction or passenger distraction.

Injuries

# Create a data frame for injury and death counts
injuries_df <- df %>%
  filter(year != 2012 & year != 2024) %>%
  select(year, month, day, `NUMBER OF PERSONS KILLED`, `NUMBER OF PERSONS INJURED`) %>%
  group_by(month, day) %>%
  summarise(across(starts_with("NUMBER OF"), ~ sum(.>= 1))) %>%
  mutate(total = `NUMBER OF PERSONS KILLED` + `NUMBER OF PERSONS INJURED`) %>%
  data.frame()

# Define the order of days
injuries_df$day <- factor(injuries_df$day, levels = c('Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'))
breaks <- c(seq(0,max(injuries_df$total), by=500))

# Create a ggplot heatmap for injuries by day of the week and month
g <- ggplot(injuries_df, aes(x=month, y=day, fill=total)) +
  geom_tile(color="black") +
  geom_text(aes(label=comma(total))) +
  coord_equal(ratio=1) +
  labs(title="Heatmap: Injuries in Accidents by Day of the Week and Month",
       x = "Month",
       y = "Day of the Week",
       fill = "Accident Count") +
  theme_minimal() +
  theme(plot.title = element_text(hjust=0.5)) +
  scale_y_discrete(limits = rev(levels(injuries_df$day))) +
  scale_fill_continuous(low="white", high="indianred", breaks = breaks) +
  guides(fill = guide_legend(reverse=TRUE, override.aes = list(color="black")))

# Display the interactive ggplot heatmap
ggplotly(g, tooltip = c("total", ~comma(total), "month", "day")) %>%
  style(hoverlabel = list(bgcolor = "white"))

Explanation of Visualization: This heat map provides a look into the number of injuries occurring in accidents, specifically looking for patterns based on the day of the week and the month. Each cell of the heatmap represents total injuries, with the color intensity conveying the magnitude.

Data Preparation: This is another graph that could be affected by the missing data in 2012 so I filtered it out along with 2024. There were significantly more injuries than fatalities (thankfully) in the data so those were combined into 1 total.

Key Observations: The greatest number of injuries in the dataset occur on Fridays - especially between the months of May and October. The least number of injuries occur on Sundays but the months of January through April have overall much lower rates.

COVID Effects

# Create a data frame for counting accidents in 2020 by date and time of day
COVID_df <- df %>%
  select(crash_date, year, day_or_night) %>%
  filter(year == 2020) %>%
  group_by(crash_date, day_or_night) %>%
  tally() %>%
  data.frame() %>%
  mutate(crash_date = as.Date(crash_date, format = "%m/%d/%Y")) %>%
  na.omit()

# Create a data frame for counting accidents in 2019 by date and time of day
preCOVID_df <- df %>%
  select(crash_date, year, day_or_night) %>%
  filter(year == 2019) %>%
  group_by(crash_date, day_or_night) %>%
  tally() %>%
  data.frame() %>%
  mutate(crash_date = as.Date(crash_date, format = "%m/%d/%Y")) %>%
  na.omit()

# Combine the data frames for 2019 and 2020 with an additional 'period' column
combined_COVID_df <- bind_rows(
  mutate(preCOVID_df, period = "2019"),
  mutate(COVID_df, period = "2020"))

# Plot side-by-side area charts
ggplot(combined_COVID_df, aes(x = crash_date, y = n, fill = day_or_night)) +
  geom_area() +
  theme_light() +
  facet_grid(~ period, scales = "free_x") +
  scale_fill_manual(values = c("night" = "darkblue", "day" = "darkgoldenrod1"),
                    labels=c("Day - 8am to 8pm", "Night - 8pm to 8am")) +
  labs(title = "Area Charts of Accident Counts in 2019 & 2020 by Time of Day",
       x = "Accident Date",
       y = "Accident Count",
       fill = "Time of Day") +
  theme(plot.title = element_text(hjust = 0.5),
        legend.position = "top")

Explanation of Visualization: These side-by-side area charts show accident counts during the years 2019 and 2020. The yellow represents accidents that took place in the day time (8:00am to 8:00pm) and the dark blue represents accidents during the night time (8:00pm to 8:00am).

Data Preparation: In the original dataset, a new column was created to differentiate between accidents that took place during the day or at night. I initially used this to look at just day/night break down in a stacked bar chart over the days of the week. After spending more time with the data, I realized it may be more interesting to look at the specific drop in accidents during and after the COVID-19 shut down.

Key Observations: The juxtaposition of 2019 and 2020 show an interesting trend of accidents going down significantly in late March of 2020 and remaining lower than the 2019 levels for the rest of the year. What I find interesting about this graph is not the steep decline in accidents in 2020, but the fact that those levels have endured to the present.

Wrap up

Summary

Several patterns emerged across the visualizations:

Accident Hotspots: The interactive map pinpointing top accident locations between 2012 and 2024 highlighted concentrations in Lower Manhattan, particularly around popular tourist destinations like Times Square and Broadway.
Accident Count Over the Years: The grouped bar charts showcased monthly accident trends from 2012 to 2023. Notably, a slight mid-year increase and a significant overall decline in total accidents since April 2020 were observed.
Accidents Per Capita by Borough: The line graph, adjusted for borough population, emphasized variations in accident rates across the five NYC boroughs. It echoed trends seen in the grouped bar graphs, with peak accident rates in October for three out of five boroughs.
Contributing Factors: The trellis donut chart provided a concise breakdown of contributing factors. Reckless driving (43.4%) and distracted driving (32%) were prominent, with specific attention to factors like outside car distraction and passenger distraction.
Injuries: The heatmap revealed patterns in injuries by day of the week and month. Fridays showed the highest injury rates, particularly from May to October, while Sundays exhibited lower rates, especially in the months of January through April.
COVID Effects: Side-by-side area charts depicted a substantial decline in accidents during 2020, particularly evident after the COVID-19 shutdown. This reduction persisted below 2019 levels for the remainder of the year, presenting an intriguing aspect of the dataset.

Additional Research

Here are some additional questions that I came up with while working with the data. Some I was not able to get to, due to time restraints or the quality of the data, others were not available to look at in the data given.

Additional questions that could be asked using this dataset:

What kind of vehicles are most commonly involved in accidents?
Have certain types of accidents become more or less common after the COVID-19 shutdown?
Are there specific trends in types of accidents that lead to more fatalities?

Additional questions that could not be answered using this dataset:

What are the reasons behind increased accidents during the summer and fall - is it due to increased tourism or more movement by locals or another factor?
What is the demographic profile of the drivers involved in accidents?
How does the current traffic infrastructure and road design contribute to or help mitigate accidents?

Conclusions

Completing this exploration of New York City’s traffic data has offered a glimpse into the geographical, human, and environmental factors that play a role in everyday traffic accidents.