Problem Statement: What NYC Boro & Precinct are the most dangerous for Motor Vehicles and Bicyclists for June 2022?

On the NYC NYPD site, https://www1.nyc.gov/site/nypd/stats/traffic-data/traffic-data-collision.page, I located Excel files containing traffic collision data parsed out by month. In its current excel format, I needed to clean up the data and then recreate it as a CSV for use in the final project. I selected the Traffic Collisions data for June 2022.

Excel File Cleanup

The clean up of the Excel file began with getting the file and reading it into a data frame.

#Get the excel file from the NYC NYPD Website
url1 <- 'https://www1.nyc.gov/assets/nypd/downloads/excel/traffic_data/cityacc-en-us.xlsx'
GET(url1, write_disk(tf <- tempfile(fileext = ".xlxs")))
df_NYCCollisions <- read_excel(tf, sheet = "GeoCollisions_1")
## New names:
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
## • `` -> `...9`
## • `` -> `...10`
## • `` -> `...11`
## • `` -> `...12`
## • `` -> `...13`
## • `` -> `...14`
# Read sheet 1 of the multi-sheet excel file and assigned it to df with new column names.
colnames(df_NYCCollisions) <- c("GeoCode", "GeoCodeLabel", "Number_of_Motor_Vehicle_Collisions", "Vehicles_or_Motorists_Involved", "Injury_or_Fatal_Collisions", "MotoristsInjured", "MotoristsKilled", "PassengInjured","PassengKilled", "CyclistsInjured", "CyclistsKilled", "PedestrInjured", "PedestrKilled", "Bicycle")

Data Wrangling - Excel file date types, column and coversion to CSV

Once the excel data was read into a data frame, further adjustments to the data were needed before a usable CSV file could be generated. I needed the GeoCode to be an integer to be able to insert values into a new column based on the GeoCode values. I wanted to preserve the NYC Boro name values that were presented as summary totals and not as discreet rows.

# Converts the GeoCode colum from char to int
df_NYCCollisions$GeoCode <- as.integer(df_NYCCollisions$GeoCode)

#Creates the "Boro" column and populates the boro name based on the precinct number in the GeoCode column.
df_NYCCollisions <- df_NYCCollisions %>% mutate(Boro =
                                                  case_when(GeoCode <= 34 ~ 'Manhattan',
                                                            GeoCode <= 52  ~ 'Bronx',
                                                            GeoCode <= 94 ~ 'Brooklyn',
                                                            GeoCode <= 115  ~ 'Queens',
                                                            GeoCode <= 123 ~ 'Staten Island'))

#Removes all the rows with NA value. This resulted in a clean data frame that could be written to a CSV file.
df_NYCCollisions <- df_NYCCollisions[is.na(df_NYCCollisions$GeoCode) == 0, c("GeoCode", "GeoCodeLabel", "Number_of_Motor_Vehicle_Collisions", "Vehicles_or_Motorists_Involved", "Injury_or_Fatal_Collisions", "MotoristsInjured", "MotoristsKilled", "PassengInjured","PassengKilled", "CyclistsInjured", "CyclistsKilled", "PedestrInjured", "PedestrKilled", "Bicycle", "Boro")]

#The data frame is exported to a local CSV file that was copied to Github.  The rest of the project reads the CSV from the Github repo
write.csv(df_NYCCollisions, "/Users/johnnyrodriguez/Documents/MSDS/NYC_Collisions_June2022.csv", row.names=FALSE)

Data Exploration

The data is read from the CSV located in the github repo from the the file generated above. From here, the summary data by boro begins to show the range of values for motor vehicle collisions and bicycle injuries fatalities.

For June 2022, Brooklyn has both the most overall traffic collisions as well as most collisions involving bicycles. While Staten Island does not have the lowest number of Collision (beaten by Manhattan), it has significantly less bicycle-involved collisions than all other boros.

#Read the CSV from the github repo
Collisions_Jun2022 <- read.csv('https://raw.githubusercontent.com/johnnymango/rbridge/main/NYC_Collisions_June2022.csv')

#Summarize Motorized Injured by Boro
tapply(Collisions_Jun2022$Number_of_Motor_Vehicle_Collisions, Collisions_Jun2022$Boro, summary) 
## $Bronx
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    88.0   104.5   125.5   130.2   150.0   184.0 
## 
## $Brooklyn
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    74.0   102.5   124.0   131.2   144.0   303.0 
## 
## $Manhattan
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00   54.50   73.00   72.32   89.00  140.00 
## 
## $Queens
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    38.0   122.8   165.5   153.8   181.2   265.0 
## 
## $`Staten Island`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    72.0   106.5   125.5   119.5   138.5   155.0
#Summarize Accidents involving Bicycles by Boro
tapply(Collisions_Jun2022$Bicycle, Collisions_Jun2022$Boro, summary) 
## $Bronx
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    4.00    5.00    6.00    6.75   12.00 
## 
## $Brooklyn
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00    6.00   11.00   11.22   14.50   26.00 
## 
## $Manhattan
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   6.000   8.000   9.545  11.750  24.000 
## 
## $Queens
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    5.00    7.00    7.00    8.25   17.00 
## 
## $`Staten Island`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       2       2       3       3       4       4

Graphics

We’ll look at 3 charts to display the most dangerous precincts and boros for traffic collisions in the NYC. The first chart displays a descending bar chart of precincts with the most collision injuries or fatalities. The second chart is is a scatterplot showing the relationship between the injuries/fatalities and bicycle-involved events grouped by Boro. The third is a box plot displaying the Injury/Fatalities stats for each boro.

Bar Plot

The 75th Precinct in Brooklyn had the most collision injuries or fatalities. However, the next 3 top precincts (105, 109, and 114) are all located in Queens.

#Precinct with most traffic collisions
ggplot(data = Collisions_Jun2022, aes(x=reorder( GeoCodeLabel, +Injury_or_Fatal_Collisions), y = Injury_or_Fatal_Collisions)) + 
  geom_bar(position="dodge", stat="identity") + coord_flip(ylim=c(0,110)) + 
  xlab("Precincts") + 
  ylab("Total Injuries of Fatalities") + 
  ggtitle("Traffic Collisions by Precinct - June 2022") 

Scatter Plot

The scatter plot indicates that, while Brooklyn and Manhattan have more instances of the Bicycle collisions in relation to Injuries or Fatalities, you are more likely to be involved in a bike collision with an injury or fatality in Queens.

#Relationship between Bicycle Collisions and Overall Injuries/Fatalities in each Boro
ggplot(Collisions_Jun2022, aes(x=Bicycle, y=Injury_or_Fatal_Collisions, color = Boro, shape = Boro)) + geom_point() + geom_smooth(method=lm) +
  xlab("Bicyle Collisions") + 
  ylab("Total Injuries of Fatalities") + 
  ggtitle("Bicycle Collisions in Overall Injuries/Fatalies by Boro - June 2022")
## `geom_smooth()` using formula 'y ~ x'

Box Plot

The box plot indicates that the frequency and average number of collision injuries or fatalities is highest in Queens and lowest in Manhattan.

ggplot(Collisions_Jun2022, aes(x=Boro, y=Injury_or_Fatal_Collisions, color=Boro)) + 
  geom_boxplot() +
  xlab("Boro") + 
  ylab("Total Injuries of Fatalities") + 
  ggtitle("Collision Injuries or Fatalities by Boro - June 2022")+
  stat_summary(
    aes(label=sprintf("%1.1f", ..y..), color=Boro),
    geom="text", 
    fun = function(y) boxplot.stats(y)$stats,
    position=position_nudge(x=0.33), 
    size=3.5) 

Conclusion

Although the initial summary data initially pointed to Brooklyn being the most dangerous boro for traffic collisions in general, and for bicycle-involved collisions specifically for the month of June, the visualizations highlight Queens as the boro experiencing collision injuries or fatalities. The bar chart shows more Queens precincts make the top 10 locations for injuries or fatalities. The scatter plot indicates Queens has the strongest relationship between injuries/fatalities and bicycle collisions. Finally, the box plot indicates Queens has the highest frequency, range and average number of collisions injuries/fatalities. Although Staten Island initially appeared to have the fewest collisions, the box plot indicates that Manhattan has lowest frequency, range and average number of collisions injuries/fatalities. The overall takeaway, if driving in car or bicycle in NYC, avoid Queens and stay in Manhattan.