The dataset that I chose to analyze for this project is one of all reported car crashes in New York City area since 2012. I believe the way this data is collected is from police reports. Every time a police officers goes to the scene of a crash he or she fills out a report for it and if the incident meets the qualifications for this dataset (anyone injured or at least $1,000 in damages) then it is added to this dataset. From what I have seen the dataset is updated daily. The dataset is a wealth of information and has a lot of variable, almost all of which are self explanatory so I’ll run through them quickly: crash date, crash time, which borough of New York City it’s in, the zip code, the longitude and latitude coordinates, the street name it happened on and the cross-street or the address when applicable of where it happened, the number of people injured and killed and what mode of transportation they were using (I. E. pedestrian, cyclist, motorist, etc.) as well as the total number of injured and killed, the cause of the crash for each vehicle, and the type of each vehicle. I chose to analyze this dataset as I love finding out more about things in my everyday life. While I try to avoid it, I often have to drive through New York City and I thought maybe understanding the crashes might also help understand why driving there is so painful.
My preliminary questions were mostly to do with causes but as I delved into the dataset I found more things to explore. My first thoughts were which types of crashes injure the largest number of people and which cause can be attributed to the most crashes. From there I started to delve into trends of when crashes mostly happen, by year, month, day of the week, and time.
library(tidyverse)
library(leaflet)
## Warning: package 'leaflet' was built under R version 4.3.2
library(anytime)
## Warning: package 'anytime' was built under R version 4.3.2
setwd("C:\\Users\\Shea\\Documents\\data110\\csvs")
crashes <- read.csv("Motor_Vehicle_Collisions_-_Crashes.csv")
names(crashes) <- tolower(names(crashes))
Separating the dates into day, month, and year
crashes <-
mutate(mutate(mutate(crashes, day=as.integer(substr(crash.date,4,5))), month=as.integer(substr(crash.date,1,2))), year=as.integer(substr(crash.date,7,10)))
Adding what day of the week the crash happened
crashes <- mutate(crashes, weekday = strftime(anydate(crash.date), "%A"))
Turning the time into an int value to allow the use of a histogram
crashhour <- mutate(crashes, timeint = as.integer(gsub(":", "", crash.time)))
write.csv(crashes, "C:\\Users\\Shea\\Documents\\data110\\csvs\\crashes2.csv", row.names=FALSE)
box <- ggplot(crashes, aes(x = contributing.factor.vehicle.1, y = number.of.persons.injured+number.of.persons.killed, fill = contributing.factor.vehicle.1)) +
geom_boxplot(show.legend = FALSE) +
labs(title = "",
x = "Contributing Factor",
y = "Persons affected",
fill = "",
caption = "Source: The State of New York") +
scale_y_continuous(trans='log2') +
coord_flip() +
theme_bw() +
theme(axis.text = element_text(size=6))
box
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 1582302 rows containing non-finite values (`stat_boxplot()`).
I found this plot to be too homogeneous to provide meaningful information. While we can see some types of crashes effect more people on average, there are too many types that look exactly the same. To combat this I tried below to filter out only crashes that injure and kill at least one person.
box2data <- filter(crashes, number.of.persons.injured+number.of.persons.killed > 0)
box2 <- ggplot(box2data, aes(x = contributing.factor.vehicle.1, y = number.of.persons.injured+number.of.persons.killed, fill = contributing.factor.vehicle.1)) +
geom_boxplot(show.legend = FALSE) +
labs(title = "Boxplot of the Number of People Harmed by the Cause of the Crash",
x = "Contributing Factor",
y = "Persons affected",
caption = "Source: The State of New York") +
scale_y_continuous(trans='log2') +
coord_flip() +
theme_bw() +
theme(axis.text = element_text(size=6))
box2
Unfortunately this plot ended up looking the exact same so I decided to abandon this front.
numcrashescausev1 <- summarise(group_by(crashes, contributing.factor.vehicle.1), num=length(contributing.factor.vehicle.1))
numbar1 <- numcrashescausev1[4:62,]
numbar <- numbar1[order(numbar1$num, decreasing = TRUE),]
bar <- numbar[2:59,] |>
ggplot() +
geom_bar(aes(x=factor(contributing.factor.vehicle.1, levels = contributing.factor.vehicle.1), y=num, fill = "#0307fc"),
position = "dodge", stat = "identity", show.legend = FALSE) +
labs(x = "Number of Crashes",
y = "Cause (Vehicle 1)",
title = "Number of Crashes Caused by each Cause",
caption = "Source: The State of New York") +
scale_y_continuous(trans='log10') +
coord_flip() +
theme_bw() +
theme(axis.text = element_text(size=6))
bar
This plot bore much more fruit in my opinion. All of the usual suspects make it up to the top, generally being a variety of the driver either not paying attention of intentionally driving unsafely. I was particularly surprised by how high up things like alcohol and prescription drugs ranked and how low talking on the phone and texting did. Car defects trend towards the lower end understandably as cars are made to be reliable.
histtime <- crashhour |>
ggplot(aes(x=timeint, fill = "#0307fc"),show.legend = FALSE) +
geom_histogram() +
labs(x = "Time",
y = "Number of Crashes",
title = "Number of crashes by Time of Day",
caption = "Source: The State of New York") +
theme_bw()
histtime
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This one was surprising to me as I expected high peaks during both rush hours. This shows a local peak by both but midday has as many crashes as morning rush hour does. Afternoon rush hour leads the number of crashes by a significant amount.
histmonth <- crashes |>
ggplot(aes(x=month, fill = "#0307fc")) +
geom_bar()+
labs(x = "Month",
y = "Number of Crashes",
title = "Number of Crashes by Month",
caption = "Source: The State of New York") +
theme_bw()
histmonth
This graph shows an interestingly larger number of crashes in the summer. I’m not sure why that would be, but I think the deviation in months within those groups is because of the number of days in the month.
histyear <- crashes |>
ggplot(aes(x=year, fill = "#0307fc"), show.legend = FALSE) +
geom_bar()+
labs(x = "Year",
y = "Number of Crashes",
title = "Number of Crashes by Year",
caption = "Source: The State of New York") +
theme_bw()
histyear
This graph shows a steep drop off that happened in 2020 as well as much fewer in 2012. 2012 I think can be attributed to the dataset started that year and may have started in the middle. 2020 and onward on the other hand, I believe indicates a disease we all know and love: COVID-19. People were doing their best to quarantine so a lot fewer people were driving around. I’m surprised to see that continue to trend down in 2022 and 2023 though, as people are no longer as worried about the virus. Maybe that can be attribute to already high real estate prices turning exhorbinate and more people working from home.
histweekday <- crashes |>
ggplot(aes(x=factor(weekday, c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")), fill = "#0307fc")) +
geom_bar(show.legend = FALSE)+
labs(x = "Day",
y = "Number of Crashes",
title = "Number of Crashes by Day of the Week",
caption = "Source: The State of New York") +
theme_bw()
histweekday
I thought the differences between the days of the week would be more pronounced but this display proves me wrong. The significant uptick on Friday might be because of people trying to rush home after a long day at work or may be because of going out to parties Friday night.
The above visualization is a graph of the number of people killed and injured by time of day. There were a few things in it that I found interesting. For one, the number of motorists killed around rush hour time is the lowest all day but the number of injured is a similar peak to the number of crashes graph. I think this might have to do with driving speeds. The number of cars on the road adds to the number of injuries but can take away from the number of deaths as traffic is moving a lot slower. This would also explain why the most deaths happen in the middle of the night. High speeds mixed with tired and potentially inebriated drivers is a recipe for disaster.
facet <- crashhour |>
ggplot(aes(x=timeint, fill = factor(weekday, c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")))) +
geom_histogram() +
scale_fill_brewer(palette = "Dark2") +
facet_wrap(~weekday) +
theme_bw() +
labs(x = "Time",
y = "Number of Crashes",
title = "Number of Crashes by Time per Day of the Week",
fill = "Day",
caption = "Source: The State of New York")
facet
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Given the significance of different times of day with crash rates, I was very curious to see how that would split up into weekday vs weekend. As I had alluded to as a guess earlier, I wasn’t surprised to see higher amounts of car accidents at night on the weekend and a lack of morning rush hour. What did surprise me though was that even Saturday and Sunday suffer from afternoon rush hour. This also helps answer my prior question about why morning rush hour only matched early afternoon: over the weekend there is a lot more driving early afternoon and very little in the morning which evens it out.
The one major thing I would have loved to do was to add a GIS plot. I had the code written and running but I realized that probably wouldn’t be possible when I ran out of memory. Also I couldn’t figure out how to factor the facet to arrange them by day but I think the number of plots is small enough that it’s easy to find the data you want.