DS736 R Project

Tempe Auto Collision Data

This report analyzes data on car collisions collected by the city of Tempe, Arizona from the years 2012-2024. Year 2024 data was only partial and was therefore omitted from most of the analysis in this report.

Dataset

The dataset contains 51,305 observations, representing 51,305 collision reports, and 35 variables, including location (longitude and latitude), date, fatalities and injuries, information about the driver(s), information about the weather, driver(s)’ drug and alcohol status, and actions and violations committed by the driver(s) in the collision.

The 51,305 collisions resulted in 161 fatalities and 23,474 non-fatal injuries. 154 collisions (0.30%) resulted in one or more fatalities, while 16,353 collisions (31.87%) resulted in one or more non-fatal injuries. 2,915 collisions (5.68%) involved a driver who was impaired by drugs or alcohol.

The weather in sunny Tempe, AZ was typically “Clear” (89.96% of collision reports), but occasionally characterized by rain, severe wind, or fog (2.34% of collision reports). Hazardous weather was, however, slightly more common in fatal collisions (3.25% of fatal collisions), with clear weather present in only 72.72% of fatal collisions.

library(data.table)
library(dplyr)
library(lubridate)
library(ggplot2)
library(ggthemes)
library(RColorBrewer)
library(leaflet)
library(scales)
library(plotly)

setwd("C://Users//johna//Documents//DS//DS 736")
df <- fread("tempe_crashes.csv")

#Calculate summary figures about dataset 
sum(df$Totalfatalities, na.rm=TRUE)
sum(df$Totalinjuries, na.rm=TRUE)
sum(df$Totalfatalities>0, na.rm=TRUE)
sum(df$Totalinjuries>0, na.rm=TRUE)
sum(df$Totalinjuries>0, na.rm=TRUE)/nrow(df)
sum(df$AlcoholUse_Drv1=="Alcohol"|df$AlcoholUse_Drv2=="Alcohol"|
      df$DrugUse_Drv1=="Drugs"|df$DrugUse_Drv2=="Drugs")
sum(df$AlcoholUse_Drv1=="Alcohol"|df$AlcoholUse_Drv2=="Alcohol"|
      df$DrugUse_Drv1=="Drugs"|df$DrugUse_Drv2=="Drugs")/nrow(df)
sum(df$Weather=="Clear")/nrow(df)
sum(df$Weather=="Rain"|df$Weather=="Blowing Sand Soil Dirt"|
      df$Weather=="Severe Crossinds"|df$Weather=="Fog Smog Smoke"|
      df$Weather=="Sleet Hail Freezing Rain Or Drizzle")/nrow(df)
sum(df$Totalfatalities>0 & (df$Weather=="Rain"|
      df$Weather=="Blowing Sand Soil Dirt"|
      df$Weather=="Severe Crossinds"|df$Weather=="Fog Smog Smoke"|
      df$Weather=="Sleet Hail Freezing Rain Or   Drizzle"))/nrow(df[df$Totalfatalities>0])
sum(df$Totalfatalities>0 & df$Weather=="Clear")/nrow(df[df$Totalfatalities>0])

Data Acquisition and Cleaning

The dataset was downloaded from the Tempe, AZ municipal government website (accessed via data.gov). Before analyzing the data, I performed data cleaning, such as dropping rows with NA values (190 rows out of 51,305 total), and dropping redundant columns.

#download and read in file
fileURL <- "https://data.tempe.gov/api/download/v1/items/0c333bd164d64d62aa0ee6f99b1ccf82/csv?layers=0"

#download.file(fileURL, "tempe_crashes.csv", method="curl")
#df <- fread("tempe_crashes.csv")

###Data Cleaning: Dropping Columns and Rows###

#drop Age_Drv2 (4,758 NA values)
df <- subset(df, select=-Age_Drv2)

#drop remaining rows with NA values (190 rows out of 51,305 total rows)
df <- na.omit(df)

#Drop X and Y columns (equivalent to Longitude and Latitude variables)
df <- subset(df, select=-c(X,Y))

Findings

Click the five tabs to see five visualizations summarizing my findings from this dataset. A paragraph or two above each chart explains its components, and in most cases a paragraph or two after the chart analyzes the significance of the data.

Fatalities Map

This interactive map shows the location of all fatal collisions over the past two years of complete data (2022-2023).

Each icon on the map represents one fatal collision. Hovering over the icon gives the name of the street where the collision occurred, while clicking on the icon gives the year, number of fatalities, and number of non-fatal injuries that resulted from the collision.

###Chart One: Map###

#Dataframe of Fatalities from 2022 and 2023
fatal <- df %>%
  select(Longitude, Latitude, Year, Totalfatalities, Totalinjuries, StreetName) %>%
  filter(Totalfatalities>0) %>%
  filter(Year==2022|Year==2023) %>%
  data.frame()

#customized icon
icon.glyphicon <- makeAwesomeIcon(icon="warning-sign", markerColor = "red", iconColor = "black")

#Map    
m <- leaflet() %>%
  addProviderTiles(providers$USGS) %>%
  setView(lng=mean(fatal$Longitude), lat=mean(fatal$Latitude), zoom=12) %>%
  addAwesomeMarkers(lng=fatal$Longitude, lat=fatal$Latitude,
             icon=icon.glyphicon,
             popup=paste("Year:",fatal$Year,
                         "<br>Fatalaties:", fatal$Totalfatalities,
                         "<br>Non-Fatal Injuries:", fatal$Totalinjuries),
             label=fatal$StreetName)

m

Collision Types by Region

As indicated in the map of fatalities, the city of Tempe is roughly a rectangle, longer north-south. The city may be divided into three geographic regions by latitude: “northern”, “central”, and “southern”, each representing one-third of the range of latitude in the dataset.

The nested donut charts show the percent and number of accidents by collision type in each of these three regions. The outer ring represents the “northern” region, the middle ring represents the “central” region, and the inner ring represents the “southern” region. Hovering over a section of the nested donut charts displays the region, collision type, total number of collisions of the specified type, and percentage of collisions of the specified type within the relevant region. This chart covers all collisions, not only collisions that led to death or injury.

###Chart Two: Nested Donut Charts###

#Defining Three Regions by Latitude
south_cent <- (max(df$Latitude) - min(df$Latitude))/3 + min(df$Latitude)
cent_north <- (max(df$Latitude) - min(df$Latitude))/3*2 + min(df$Latitude)


#Dataframe of accident types and region
acc_types <- df %>%
  select(Unittype_One, Unittype_Two, Latitude)%>%
  filter(Unittype_Two!="")%>%
  mutate(joint_type = paste(Unittype_One, Unittype_Two,sep="-"))%>%                                            
  mutate(region = ifelse(Latitude < south_cent, "Southern",
                          ifelse(Latitude < cent_north, "Central",
                                 "Northern")))%>%
  data.frame()

#Standardizing values
acc_types$joint_type <- replace(acc_types$joint_type,
                                acc_types$joint_type=="Pedalcyclist-Driver",
                                "Driver-Pedalcyclist")


acc_types$joint_type <- replace(acc_types$joint_type,
                                acc_types$joint_type=="Pedestrian-Driver",
                                "Driver-Pedestrian")

#Filtering out values without driver type
acc_types <- acc_types %>%
  filter(joint_type!="Pedalcyclist-Pedalcyclist"&joint_type!="Pedestrian-Pedestrian")%>%
  data.frame()

#Aggregating accident type by region
acc_types <- acc_types %>%
  group_by(region, joint_type)%>%
  summarize(n=length(joint_type), .groups="keep")%>%
  data.frame()

#Nested Donut Charts  

fig <- plot_ly(hole=.7) %>%
  layout(title="Accidents by Collision Type across Three Regions")%>%
  add_trace(data=acc_types[acc_types$region=="Northern",],
            labels=~joint_type,
            values=~n,
            type="pie",
            textpostion="inside",
            hovertemplate="Region: Northern<br>Type: %{label}<br>Percent: %{percent}<br>Accidents: %{value}<extra></extra>")%>%
  add_trace(data=acc_types[acc_types$region=="Central",],
            labels=~joint_type,
            values=~n,
            type="pie",
            textposition="inside",
            hovertemplate="Region: Central<br>Type: %{label}<br>Percent: %{percent}<br>Accidents: %{value}<extra></extra>",
            domain=list(
              x=c(.16,.84),
              y=c(.16,.84)
            ))%>%
  add_trace(data=acc_types[acc_types$region=="Southern",],
            labels=~joint_type,
            values=~n,
            type="pie",
            textposition="inside",
            hovertemplate="Region: Southern<br>Type: %{label}<br>Percent: %{percent}<br>Accidents: %{value}<extra></extra>",
            domain=list(
              x=c(.27,.73),
              y=c(.27,.73)
            ))

fig

The charts show an increase in collisions between drivers and pedestrians, bicyclists, and driverless cars for the northern region. This appears to be a denser region, and it is also where the university is located. In contrast, there are relatively more driver-driver collisions in the southern region, which appears to be more suburban.

Fatal Collisions: Violation Type & Weather

This horizontal stacked bar chart shows the number of fatal collisions by the recorded violation of the primary driver (or pedestrian) and by the weather conditions at the time of the collision. The chart only includes the top 5 (with a tie for fifth) violations that caused fatalities (ordered by total number of collisions involving a fatality).

###Chart Three: Stacked Bar Chart###

#DataFrame of all accidents with fatalities
fatal <- df[df$Totalfatalities>0] 

#Removing unknown and other violations
fatal <- fatal[fatal$Violation1_Drv1!="Unknown"&fatal$Violation1_Drv1!="Other"]

#Replacing typographical error in Violations Column
fatal$Violation1_Drv1 <- replace(fatal$Violation1_Drv1,
                                 fatal$Violation1_Drv1=="Speed To Fast For Conditions",
                                 "Speed Too Fast For Conditions")

#Top 5 (with tie for 5th) primary driver fatal violations DataFrame
fatal_violations_top6 <- fatal%>%
  count(Violation1_Drv1)%>%
  arrange(desc(n))%>%
  filter(rank(desc(n))<=6)%>%
  data.frame()

#Top 5 (with tie for 5th) primary drivrer fatal violations with weather DataFrame
fatal_violations_top6_weather <- fatal%>%
  select(Weather, Violation1_Drv1)%>%
  filter(Violation1_Drv1 %in% fatal_violations_top6$Violation1_Drv1)%>%
  group_by(Weather)%>%
  count(Violation1_Drv1)%>%
  arrange(desc(n))%>%
  data.frame()

#Horizontal Stacked Bar Chart of Top 5 fatal violations with weather
ggplot(fatal_violations_top6_weather, aes(x=reorder(Violation1_Drv1,n,sum), y=n, fill=Weather)) +
  geom_bar(stat="identity", position=position_stack()) +
  coord_flip()+
  labs(title="Fatal Collisions by Violation Type and Weather", x="Primary Violation",
       y="Number of Fatal Accidents", fill="Weather")+
  theme_few() +
  theme(plot.title = element_text(hjust=.5)) +
  scale_fill_brewer(palette = "Set2") +
  scale_y_continuous(breaks=seq(0,max(fatal_violations_top6$n+10),by=10),
                     limits=c(0,max(fatal_violations_top6$n+10))) +
  geom_text(data=fatal_violations_top6, aes(x=Violation1_Drv1, y=n, label=n, fill=NULL), hjust=-.5)

As the chart suggests, while the weather was typically clear, it was clear less frequently during fatal collisions than during non-fatal collisions. As noted above, the weather in sunny Tempe, AZ was typically “Clear” (89.96% of collision reports), but occasionally characterized by rain, severe wind, or fog (2.34% of collision reports). Hazardous weather was, however, slightly more common in fatal collisions (3.25% of fatal collisions), with clear weather present in only 72.72% of fatal collisions.

The chart also indicates that a) excessive speed and b) failure to obey traffic signals and right-of-way rules accounted for most fatal accidents. However, 6 out of 154 fatal accidents were determined to be primarily the fault of a pedestrian failing to use a crosswalk.

Injurious Collisions Pre- vs. Post-Covid

These two heat maps show the total number of injuries resulting from collisions by driver action and by day of the week. The first heat map represents three Pre-Covid years (2016-2019), while the second heat map represents Post-Covid years (2020-2023).

“Driver action” refers to the car’s action (such as turning, slowing, changing lanes, etc.), regardless of whether or not that action was culpable, whereas the “Driver violation” examined in the stacked bar chart refers to at-fault driver actions. The heat maps only include the top 5 driver actions that caused injuries (ordered by total number of injuries caused).

###Chart Four: Heat Maps###

#Processing Year and Day Data
x <- ymd_hms(df$DateTime)
df$weekday <- weekdays(x, abbreviate=TRUE)
df$weekday <- as.factor(df$weekday)
levels(df$weekday) <- c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")
df$Year <- as.factor(df$Year)

#Processing Driver Action Values
df$action <- ifelse(df$Unitaction_One=="Entering Parking Position"|
                            df$Unitaction_One=="Leaving Parking Position"|
                            df$Unitaction_One=="Improperly Parked"|
                            df$Unitaction_One=="Properly Parked"|
                            df$Unitaction_One=="Standing"|
                            df$Unitaction_One=="Getting On or Off Vehicle",
                          "Parking-Related",
                          df$Unitaction_One)

#Determine top 5 primary driver actions by total number of injuries caused
top5_inj <- df %>%
  select(Totalinjuries, action) %>%
  filter(action!="Unknown"&action!="Other")%>% #Filter out "Unknown" or "Other" action types
  group_by(action) %>%
  summarize(n=sum(Totalinjuries)) %>%
  arrange(desc(n)) %>%
  filter(rank(desc(n))<=5)%>%
  data.frame()


#Dataframe of top 5 primary driver actions by total number of injuries caused by year
inj <- df %>%
  select(Totalinjuries, action, weekday, Year) %>%
  filter(Year==2016|Year==2017|Year==2018|Year==2019|
           Year==2020|Year==2021|Year==2022|Year==2023) %>% #filter out partial year data from 2024
  filter(action %in% top5_inj$action) %>%
  mutate(covid = ifelse(Year==2024|Year==2023|Year==2022|Year==2021|Year==2020,
                        "Post-Covid (2020-2023)",
                        "Pre-Covid (2016-2019)"))%>%
  group_by(action, covid, weekday) %>%
  summarize(n=sum(Totalinjuries), .groups="keep") %>%
  data.frame()

inj$covid <- factor(inj$covid, levels=c("Pre-Covid (2016-2019)", 
                                        "Post-Covid (2020-2023)"))

#Heat Maps Pre- and Post-Covid
ggplot(inj, aes(x=weekday, y=reorder(action, n), fill=n))+
  geom_tile(color="black")+
  geom_text(aes(label=comma(n)))+
  coord_equal(ratio=1)+
  labs(title = "Injuries by Weekday by Driver Action, Pre- vs. Post-Covid",
       x= "Day of the Week",
       y= "Primary Driver Action",
       fill= "Total Number of Injuries")+
  theme_pander()+
  theme(plot.title = element_text(hjust=.5))+
  scale_fill_continuous(low="white", high="red")+
  facet_wrap(~covid, ncol=1, nrow=2)

It is noteworthy that “Making a left turn” caused significantly more injuries than “Making a right turn.” This is not surprising, since making a left turn typically requires crossing two directions of traffic.

Both heat maps show a dip in injuries in the early-middle part of the week. Comparing the two maps also shows a decrease in injuries in the post-Covid period - possibly due to a decrease in commuting, although other factors may be responsible.

Impaired Driver Collisions

This line plot shows the number of accidents caused by “impaired” drivers per day of the week over the most recent five years of complete data (2019-2023). I classified drivers as “impaired” if they were recorded as under the influence of alcohol or drugs at the time of the collision.

###Chart Five: Line Plot###

#Alcohol and Drug Impaired Accidents Dataframe Over Last 5 Years of Complete Data
alcdrugs_df <- df %>%
  select(Year, weekday, AlcoholUse_Drv1, DrugUse_Drv1, AlcoholUse_Drv2, DrugUse_Drv2) %>%
  filter(Year==2019|Year==2020|Year==2021|Year==2022|Year==2023) %>% #Last Five Years (2024 data is very partial)
  mutate(Type = ifelse(AlcoholUse_Drv1=="Alcohol"|AlcoholUse_Drv2=="Alcohol"|
                         DrugUse_Drv1=="Drugs"|DrugUse_Drv2=="Drugs",
                       "Impaired", "Unimpaired")) %>%
  group_by(weekday, Year) %>%
  summarize(Impaired = sum(Type=="Impaired"),
            Unimpaired = sum(Type=="Unimpaired"),
            Total = length(Year), .groups="keep") %>%
  data.frame()

#Line Plot of Alcohol and Drug Impaired Accidents by Day and Year
ggplot(alcdrugs_df, aes(x=weekday, y=Impaired, group=Year)) +
  geom_line(aes(color=Year), size=2) +
  labs(title = "Accidents with Impaired Drivers by Day and Year, 2019–2023",
       x = "Day of the Week",
       y = "Number of Accidents") +
  theme_linedraw() +
  theme(plot.title = element_text(hjust=.5)) +
  geom_point(shape=18, size=3.5, color="darkblue") +
  scale_y_continuous(breaks=seq(0,max(alcdrugs_df$Impaired),by=10)) +
  scale_color_brewer(palette="Set3", guide=guide_legend(reverse=TRUE))

Surprisingly, the chart shows a dip in DUI-related accidents over the weekend but a rise in the middle of the week (peaking Wednesday and Thursday). This may reflect increased alcohol and drug use over the course of the week along with decreased driving over the weekend. The chart also suggests an increase in DUI-related accidents in the three most recent years of data (2021-2023) compared to 2019-2020.

Recommendations

Traffic control measures in Tempe should focus on reducing speed and enforcing right-of-way rules. Where possible, streets should be designed so as to reduce left turns and protect drivers making necessary lefts.

Efforts to target impaired drivers should focus on the middle to end of the work week, even though injury-causing accidents, in general, dip during this period.

Finally, traffic control measures should concentrate on the northern and central parts of the city, which saw the most collisions and most fatalities in recent years.