In this assignment, I will be looking at a dataset of motor vehicle collisions in New York City over the last three years. The dataset is hosted on the NYC Open Data website and includes all vehicle collisions recorded by the NYPD since July 2012. This is a live dataset that is updated every weekday with recent accident data. Over 50,000 accidents are recorded in the set, along with details relevant to the accidents: time, data, vehicle type, possible contributing factors, injuries, fatalities, and the precise coordinates of each incident. I believe Vision Zero is an important policy initiative to reduce traffic accidents in New York City, but I am uncertain of the efficacy of reducing speed limits. My hypothesis is that the most accidents in NYC occur at difficult intersections and conjection points, and not necessarily as a result of speeding.
To start our analysis, we will gather our data and load the necessary packages we need for our analysis. For optimal performance, we will pull the json data directly from the website and store it in a dataframe for analysis.
The Motor Vehicle Collision Data Can be accessed here:
https://nycopendata.socrata.com/Public-Safety/NYPD-Motor-Vehicle-Collisions/h9gi-nx95
In JSON Format:
https://data.cityofnewyork.us/resource/h9gi-nx95.json
Data Sources for Road Miles and Blocks For Each Borough:
Manhattan: http://www.fcny.org/cmgp/streets/pages/2001PDF/Report/DFMN.pdf Queens: http://www.fcny.org/cmgp/streets/pages/2001PDF/Report/DFQU.pdf Bronx: http://www.fcny.org/cmgp/streets/pages/2001PDF/Report/DFBX.pdf Staten Island: http://www.fcny.org/cmgp/streets/pages/2001PDF/Report/DFSI.pdf Brooklyn: http://www.fcny.org/cmgp/streets/pages/2001PDF/Report/DFbk.pdf
suppressWarnings(suppressMessages(library(ggmap)))
suppressWarnings(suppressMessages(library(ggplot2)))
suppressWarnings(suppressMessages(library(plyr)))
suppressWarnings(suppressMessages(library(RCurl)))
suppressWarnings(suppressMessages(library(jsonlite)))
suppressWarnings(suppressMessages(library(knitr)))
This call will access the up-to-date Json data and convert it into a dataframe:
jsonURL <- getURL("https://data.cityofnewyork.us/resource/h9gi-nx95.json?$limit=50000")
accidents <- fromJSON(jsonURL,simplifyDataFrame=TRUE)
Some of the accident data will need to be converted to class numeric or factor for our analysis:
accidents$borough <- as.factor(accidents$borough)
accidents$latitude <- as.numeric(accidents$latitude)
accidents$longitude <- as.numeric(accidents$longitude)
I gathered information on the number of blocks and road miles per borough from the above reports published by the Fund for the City of New York. Since we do not have data on the number of intersections in NYC, we can infer that it will be roughly proportional to the number of blocks. Unfortunately, the information is stored in PDFs, so I had to gather the information manually.
collisions <- summary(accidents$borough)[2:6]
miles <- c(771,1742,508,2136,788)
blocks <- c(9921,21410,6718,27780,9937)
densityData <- rbind(collisions,miles,blocks)
densityData <- as.data.frame(densityData)
Now we will draw up some descriptive graphics. We can use a pie chart to visualize the total number of accidents by borough, as it is a small number of categories. Here we see that approximately 15% of the 500,000+ accidents are lacking geographical information. My hypothesis would be that the number of accidents is proportional to the density of the borough. The pie chart shows that there are more accidents in Brooklyn than anywhere else and that there are approximately as many accidents in Queens as in Manhattan.
byBorough <- summary(accidents$borough)
pie(byBorough)
Quick Facts: Total Number of Miles and Blocks in NYC
sum(miles)
## [1] 5945
sum(blocks)
## [1] 75766
Now we will examine contributing factors in accidents. The dataset has five columns of contributing factors to accidents, one for each possible vehicle involved, up to 5 different cars. We will see that there is a wide variety of contributing factors to accidents in NYC, but unsafe speed is only mentioned in 43 cases in the entire dataset.
allFactors <- rbind(accidents$contributing_factor_vehicle_1,accidents$contributing_factor_vehicle_2,accidents$contributing_factor_vehicle_3,accidents$contributing_factor_vehicle_4,accidents$contributing_factor_vehicle_5)
allFactors <- as.factor(allFactors)
allFactors <- sort(summary(allFactors),decreasing=T)
facTable <- as.table(allFactors)
# Remove the first two values, since they are NA and "Unspecified"
facTable <- facTable[-1]
facTable <- facTable[-1]
facTable <- as.data.frame(facTable)
kable(facTable, style = "markdown")
| facTable | |
|---|---|
| Driver Inattention/Distraction | 6645 |
| Failure to Yield Right-of-Way | 2517 |
| Fatigued/Drowsy | 2510 |
| Other Vehicular | 2024 |
| Backing Unsafely | 1707 |
| Lost Consciousness | 1450 |
| Turning Improperly | 1260 |
| Prescription Medication | 1232 |
| Pavement Slippery | 864 |
| Driver Inexperience | 742 |
| Physical Disability | 721 |
| Traffic Control Disregarded | 662 |
| Outside Car Distraction | 565 |
| Alcohol Involvement | 410 |
| Oversized Vehicle | 337 |
| Passenger Distraction | 278 |
| View Obstructed/Limited | 256 |
| Other Electronic Device | 189 |
| Aggressive Driving/Road Rage | 177 |
| Illness | 169 |
| Glare | 131 |
| Reaction to Other Uninvolved Vehicle | 125 |
| Failure to Keep Right | 106 |
| Brakes Defective | 100 |
| Obstruction/Debris | 100 |
| Fell Asleep | 73 |
| Pavement Defective | 73 |
| Unsafe Speed | 43 |
| Drugs (Illegal) | 40 |
| Steering Failure | 40 |
| Lane Marking Improper/Inadequate | 36 |
| Traffic Control Device Improper/Non-Working | 35 |
| Following Too Closely | 27 |
| Accelerator Defective | 23 |
| Animals Action | 23 |
| Tire Failure/Inadequate | 23 |
| Passing or Lane Usage Improper | 22 |
| Unsafe Lane Changing | 21 |
| Cell Phone (hands-free) | 14 |
| Pedestrian/Bicyclist/Other Pedestrian Error/Confusion | 9 |
| Windshield Inadequate | 7 |
| Cell Phone (hand-held) | 6 |
| Headlights Defective | 5 |
| Other Lighting Defects | 5 |
| Shoulders Defective/Improper | 4 |
| Tow Hitch Defective | 3 |
Now we will perform linear regression on both blocks and miles as the independent variables.The coefficients from the linear regression show us tha that the number of road miles in a borough has a fairly significant effect on the number of collisions that occur, but the number of blocks is fairly insignificant. In the following two plots, we will plot each variable separately against the number of collisions along with the least fit line. We will see that there is a negative correlation between the amount of road miles in a borough and the number of collisions. The plot showing blocks and collisions hardly varies from the first, as it has little effect on the number of accidents. There is also however a high degree of variability in these plots, but this is to be expected with a small number of data points.
fit <- lm(collisions ~ miles + blocks, data=densityData)
summary(fit)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13858.665944 3137.182561 4.417552 0.04761322
## miles 58.569570 41.621555 1.407193 0.29465486
## blocks -4.922372 3.270036 -1.505296 0.27118757
plot(blocks,collisions,xlab="Blocks",ylab="Collisions",main="Density 1 - Collision vs. Blocks")
abline(lm(collisions ~ blocks,densityData))
plot(miles,collisions,xlab="Road Miles",ylab="Collisions",main="Density 2 - Collision vs. Road Miles")
abline(lm(collisions ~ miles,densityData))
Now we will create heatmaps of accidents for each borough. We will do so by creating a new dataframe of accidents for each borough and plotting the density of accidents with the \(\textit{ggmap}\) package.
First we will subset the dataframes for each borough:
justManhattan <- subset(accidents,borough == "MANHATTAN")
justBrooklyn <- subset(accidents,borough == "BROOKLYN")
justQueens <- subset(accidents,borough == "QUEENS")
justBronx <- subset(accidents,borough == "BRONX")
justSI <- subset(accidents,borough == "STATEN ISLAND")
Now we will initialize the map objects for each:
mapManhattan <- suppressWarnings(suppressMessages(get_map(location="Manhattan", source="google", maptype="roadmap", crop=FALSE,zoom=12)))
mapBrooklyn <- suppressWarnings(suppressMessages(get_map(location="Brooklyn", source="google", maptype="roadmap", crop=FALSE,zoom=12)))
mapQueens <- suppressWarnings(suppressMessages(get_map(location="Queens", source="google", maptype="roadmap", crop=FALSE,zoom=11)))
mapBronx <- suppressWarnings(suppressMessages(get_map(location="Bronx", source="google", maptype="roadmap", crop=FALSE,zoom=12)))
mapSI <- suppressWarnings(suppressMessages(get_map(location="Staten Island", source="google", maptype="roadmap", crop=FALSE,zoom=12)))
ggmap(mapManhattan) + stat_density2d(aes(x = longitude, y = latitude,alpha=0.01), size = 0.05, bins = 10, data = justManhattan, geom = "polygon", colour = "red") + ggtitle("Manhattan - Heatmap of Accidents")
## Warning: Removed 80 rows containing non-finite values (stat_density2d).
If we look closely, we that the accidents are concentrated around all of the access points to bridges and tunnels. It should not be suprising that vehicle collisions occur more frequently at these congestion points, but it brings into question whether speed is a factor in most collisions.
ggmap(mapBrooklyn) + stat_density2d(aes(x = longitude, y = latitude,alpha=0.01), size = 0.05, bins = 10, data = justBrooklyn, geom = "polygon", colour = "red") + ggtitle("Brooklyn - Heatmap of Accidents")
## Warning: Removed 712 rows containing non-finite values (stat_density2d).
The map of Brooklyn shows a number of areas where vehicle collisions are common. There appear to be a few clusters on both the BQE (I-278) and Atlantic Avenue.
ggmap(mapQueens) + stat_density2d(aes(x = longitude, y = latitude,alpha=0.01), size = 0.05, bins = 14, data = justQueens, geom = "polygon", colour = "red") + ggtitle("Queens - Heatmap of Accidents")
ggmap(mapBronx) + stat_density2d(aes(x = longitude, y = latitude,alpha=0.01), size = 0.05, bins = 10, data = justBronx, geom = "polygon", colour = "red") + ggtitle("Bronx - Heatmap of Accidents")
ggmap(mapSI) + stat_density2d(aes(x = longitude, y = latitude,alpha=0.01), size = 0.05, bins = 10, data = justSI, geom = "polygon", colour = "red") + ggtitle("Staten Island - Heatmap of Accidents")
We can compare these heatmaps to the Vision Zero View infographic here: http://www.vzv.nyc/
The Vision Zero Plan emphasizes “Arterial Slow Zones” - lowering speed limits on specific roadways - to reduce avoidable accidents. Comparing our heatmaps with Vision Zero View, we see that some of the new speed limits will problematic roadways in Brooklyn and Queens, but not as much in the other boroughs. We see a high volume of accidents in the Arterial Slow Zones on Queens Boulevard (Queens) and Atlantic Ave and Ocean Parkway south of Prospect Park (Brooklyn). The new zones, however, do not cover problem areas in the Bronx (roadways near the Harlem River), Staten Island (Hyland Boulevard and the entrance to the Verrazona-Narrows Bridge) or Manhattan (all access points). In fact, the slow zones on Broadway and Amsterdam on the Upper West Side appear on our heatmap to be in a low-accident area.