CUNY DATA 608
Question
NYC is one of the largest and busiest cities in the world. Accidents in NYC are a common occurance - according to the NYPD, there are about 678 car accidents a day! This research study will investigate accidents in New York City - SI. The focus will be on what are the leading causes of accidents, what type of vehicles are involved in an accident, and where are the “hot spots” - or major areas prone to an accident.
Libraries
library(tidyverse)
library(plotly)
library(readr)
library(knitr)
library(leaflet)
library(tigris)
library(httr)
library(leaflet.extras)Data
The data is taken from NYC Open data - NYPD Motor Vehicle Collisions.
df <- read_csv("https://raw.githubusercontent.com/mandiemannz/Data-608/master/NYPD_Motor_Vehicle_Collisions%202017.csv")kable(head(df))| DATE | TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | CROSS STREET NAME | OFF STREET NAME | NUMBER OF PERSONS INJURED | NUMBER OF PERSONS KILLED | NUMBER OF PEDESTRIANS INJURED | NUMBER OF PEDESTRIANS KILLED | NUMBER OF CYCLIST INJURED | NUMBER OF CYCLIST KILLED | NUMBER OF MOTORIST INJURED | NUMBER OF MOTORIST KILLED | CONTRIBUTING FACTOR VEHICLE 1 | CONTRIBUTING FACTOR VEHICLE 2 | CONTRIBUTING FACTOR VEHICLE 3 | CONTRIBUTING FACTOR VEHICLE 4 | CONTRIBUTING FACTOR VEHICLE 5 | UNIQUE KEY | VEHICLE TYPE CODE 1 | VEHICLE TYPE CODE 2 | VEHICLE TYPE CODE 3 | VEHICLE TYPE CODE 4 | VEHICLE TYPE CODE 5 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 03/20/2017 | 07:00:00 | STATEN ISLAND | 10306 | 40.57046 | -74.10977 | (40.570465, -74.10977) | HYLAN BOULEVARD | NEW DORP LANE | NA | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Following Too Closely | Unspecified | NA | NA | NA | 3635600 | SPORT UTILITY / STATION WAGON | PASSENGER VEHICLE | NA | NA | NA |
| 03/20/2017 | 08:00:00 | STATEN ISLAND | 10309 | 40.54909 | -74.22084 | (40.54909, -74.22084) | VETERANS ROAD WEST | BLOOMINGDALE ROAD | NA | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Turning Improperly | Unspecified | NA | NA | NA | 3635743 | PASSENGER VEHICLE | NA | NA | NA | NA |
| 03/20/2017 | 12:28:00 | STATEN ISLAND | 10309 | 40.53793 | -74.21623 | (40.537933, -74.21623) | NA | NA | 53 MARISA CIRCLE | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Driver Inattention/Distraction | Unspecified | NA | NA | NA | 3635748 | SPORT UTILITY / STATION WAGON | NA | NA | NA | NA |
| 03/20/2017 | 12:55:00 | STATEN ISLAND | 10312 | 40.53536 | -74.15594 | (40.535355, -74.15594) | KING STREET | RICHMOND AVENUE | NA | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Unspecified | Unspecified | NA | NA | NA | 3635794 | SPORT UTILITY / STATION WAGON | NA | NA | NA | NA |
| 03/20/2017 | 09:50:00 | STATEN ISLAND | 10312 | 40.56041 | -74.16975 | (40.56041, -74.16975) | NA | NA | 3229 RICHMOND AVENUE | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Unspecified | Unspecified | NA | NA | NA | 3635796 | PASSENGER VEHICLE | NA | NA | NA | NA |
| 03/20/2017 | 15:10:00 | STATEN ISLAND | 10301 | NA | NA | NA | NA | NA | 2 ST. PAULS AVENUE | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Other Vehicular | Unspecified | NA | NA | NA | 3635845 | PICK-UP TRUCK | SPORT UTILITY / STATION WAGON | SPORT UTILITY / STATION WAGON | NA | NA |
Contributing Factors
The first step to take a closer look at what factors contributed towards a driver getting into an accident is to filter by the number of observations. The dataset has a lot of low count occurances; so the data is filtered by a count of greater than 50.
df1 <- df %>%
group_by(df$`CONTRIBUTING FACTOR VEHICLE 1`) %>%
filter(n()>50)The data is then transformed into a gg plotly graph.
p <- ggplot(df1,
aes(df1$`CONTRIBUTING FACTOR VEHICLE 1`)) +
geom_bar(aes(fill=df1$`VEHICLE TYPE CODE 1`)) +
coord_flip() +
theme(legend.position = "none") +
xlab("Contributing Factor")
p <- ggplotly(p)
pLooking at the graph, Driver Inattention/Distraction has the highest count of occurances within this dataset, followed by failure to yield right-of-way and following too closely. Unspecified, while it has a high # of occurances, doesn’t indicate what the actual contributiong factor was. One might assume that distracted drivers are busy on their cell phones.
The next investgation is of the types of vehicles themselves. For this analysis, the data is filtered by a count of greater than 10. For this varible, there was also a lot of single occurances in the data for types of vehicles.
vehicle <- df %>% group_by(df$`VEHICLE TYPE CODE 1`) %>% filter( n() > 10 )vehicles <- ggplot(vehicle, aes(vehicle$`VEHICLE TYPE CODE 1`)) +
geom_bar(aes(fill=vehicle$`VEHICLE TYPE CODE 1`))+
theme(legend.position = "none") +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
xlab("Vehicle Type") +
ggtitle("Count of Vehicle Types per Accidents")
vehicles<- ggplotly(vehicles)
vehiclesLooking at the graphs above, the data shows that the type of vehicle with the most accidents is a passenger vehicle, followed by an SUV - which makes sense - most vehicles found are of that category.
Location of Accidents
The next investigation was of where exactly was the most accidents - prehaps there are locations that can be investigated by the local authorities.
streetname <- df %>%
group_by(df$`ON STREET NAME`) %>%
filter(n()>50)
streetnames <- ggplot(streetname, aes(streetname$`ON STREET NAME`)) +
geom_bar(aes(fill = streetname$`VEHICLE TYPE CODE 1`)) +
theme(legend.position = "none") +
theme(axis.text.x=element_text(angle=40, hjust=1)) +
xlab("Street Name") +
ggtitle("Count of Vehicle Types vs. Street")
streetnames<- ggplotly(streetnames)
streetnamesAccording to the data, it seems that a lot of the location data is missing - null. Following the null data, we can see that Hylan Boulevard and Richmond road have the higest occurance of accidents recorded.
Leaflet Map
The next step in the analysis is to plot the datapoints on a map - this allows us to better see areas of accidents. The data is filtered to include only SI - however, it seems that some lat/longs point to other areas in NYC.
df2 <- subset(df, select=c("LONGITUDE", "LATITUDE", "CONTRIBUTING FACTOR VEHICLE 1"))
df2 <- na.omit(df2)leaflet() %>%
addTiles() %>%
addProviderTiles("CartoDB.Positron") %>%
setView(-74.15, 40.57, zoom = 11) %>%
addHeatmap(
lng = df2$LONGITUDE, lat = df2$LATITUDE,blur = 20, max = 0.05, radius = 15
)The heatmap shows what our other data indicated - the most accidents seem to happy on Southern Hyland Blvd.
Conclusion
The data shows that the most accidents occur due to driver inattention. One could assume that inattention is related to cell-phone usage. The data also shows that the most accidents appear to happen in a sedan or SUV, and are located within busy major streets.