Introduction

Road traffic accidents represent a significant global health crisis, causing widespread devastation to individuals, families, and communities. Every day, thousands of people are affected by traffic accidents worldwide. Approximately 1.19 million people die annually due to road traffic crashes, with an additional 20 to 50 million suffering non-fatal injuries, many resulting in disabilities. This translates to one person dying in a road accident every 24 seconds (World Health Organization [WHO], 2021). Road traffic accidents are currently the eighth leading cause of death globally, with projections suggesting a rise to the seventh position by 2030 if current trends persist (WHO, 2021). The impact extends beyond immediate fatalities, with survivors facing long-term consequences including extended hospital stays and permanent disabilities affecting their ability to work and maintain quality of life (Jones et al., 2019). The emotional and financial toll on victims’ families creates ripple effects throughout communities (Smith & Johnson, 2020).

Objective

This research aims to gain a deep understanding of the critical variables significantly contributing to road accidents, addressing accident prediction as a classification issue.

Business Questions

The study seeks to answer the following business questions:

What are the key factors that contribute to severe traffic accidents?
What patterns exist in traffic accidents across different times of the day, days of the week, or months?
When do traffic crashes occur most frequently, and how do these patterns vary by time of day, day of the week, or month?
How do weather and lighting conditions affect the frequency of traffic crashes?
How effective are traffic control devices in preventing or reducing traffic crashes?
Where are the high-risk crash locations, and what geographic factors contribute to accidents?
How does the type of roadway influence the occurrence and severity of crashes?
What trends in traffic crashes have emerged over the years?

The study employs machine learning models to predict road accident severity, analyzing the results and applying explainable ML (XML) techniques to evaluate the importance of contributing factors. The research use a set of ML algorithms, including Random Forest (RF), XGBoost, and CatBoost. These models were selected for their proven effectiveness in handling complex, high-dimensional datasets and their ability to capture non-linear relationships between variables (Ahmed et al., 2023). Ensemble methods like RF and boosting algorithms have shown superior performance in similar classification tasks, offering robust predictions and insights into feature importance (Dong et al., 2022).

Methodology

In this analysis, we used various tools and techniques to explore and derive insights from the traffic crash dataset sourced from the City of Chicago’s public data portal. The primary aim was to identify key patterns, trends, and factors contributing to traffic accidents, which could help inform road safety measures and improve traffic management strategies.

Data Collection

The dataset used for this analysis is sourced from the City of Chicago’s public data portal, specifically traffic crash and vehicle data. It contains detailed records of traffic crashes reported in Chicago, offering insights into the circumstances surrounding each crash. The dataset includes various variables such as crash date, location, weather and lighting conditions, crash severity, vehicle year, make, and model, as well as contributing causes, injuries, and fatalities.

Data Preprocessing

Two data set were collected traffic crash data and vehicle data. When we merge the data, we initiall had 1830208 and 25 selected columns. The merged dataset had multiple duplicates and repeating factors per accident. For instance, if one accident involved five persons (two drivers and three passengers), the dataset included all personal information in the factors contributing to the accident, which is unnecessary for our work. Hence, when selecting factors contributing to an accident, we chose only one row per accident, where the unit_type was Driver. Again we extracted the CRASH_YEAR from the CRASH_DATE.

The original dataset included five classes for MOST_SEVERE_INJURY: FATAL, NO INDICATION OF INJURY, REPORTED, NOT EVIDENT, NONINCAPACITATING INJURY, and INCAPACITATING INJURY. We removed rows categorized as “REPORTED, NOT EVIDENT”(39098 rows) and retained the remaining four injury classes for further analysis. Additionally, the MAKE and MODEL columns were combined to form VEHICLE_TYPE column, resulting in 854,659 rows and 24 columns in the final dataset.

Missing values and Outliers

The majority of the variables, such as crash severity, weather conditions, and contributory causes, have no missing values. However, a small percentage (0.72%) of the LATITUDE and LONGITUDE values were missing and 14.7% from the VEHICLE_YEARS. We remove rows with missing lattiude and longitude rows and fill the vehicle year with 2014. fter cleaning, the dataset was finalized with 848,481 rows and 25 columns. Outliers was treated using log transformation. A detailed list of the data variables can be found in the appendix.

# Calculate missing value percentages
> total_rows <- combined_data_select %>% count() %>% pull()
> missing_values <- combined_data_select %>%
+   summarize(across(everything(), ~ sum(isnull(.) %>% as.integer()))) %>%
+   collect()
> missing_percentage <- missing_values %>%
+   mutate(across(everything(), ~ . / total_rows * 100))
> print(as.data.frame(missing_percentage))
  CRASH_RECORD_ID POSTED_SPEED_LIMIT WEATHER_CONDITION LIGHTING_CONDITION FIRST_CRASH_TYPE
1               0                  0                 0                  0                0
  TRAFFICWAY_TYPE ALIGNMENT DAMAGE MOST_SEVERE_INJURY PRIM_CONTRIBUTORY_CAUSE ROAD_DEFECT
1               0         0      0                  0                       0           0
  TRAFFIC_CONTROL_DEVICE DEVICE_CONDITION INJURIES_TOTAL CRASH_HOUR CRASH_DAY_OF_WEEK CRASH_MONTH
1                      0                0              0          0                 0           0
   LATITUDE LONGITUDE VEHICLES_INVOLVED VEHICLE_YEARS MANEUVERS CRASH_YEAR VEHICLE_TYPE
1 0.7228614 0.7228614                 0      15.08871         0          0            0

Analysis

# Load required packages
library(pacman)
p_load(tidyverse)
library(sparklyr)

## 
## Attaching package: 'sparklyr'

## The following object is masked from 'package:purrr':
## 
##     invoke

## The following object is masked from 'package:stats':
## 
##     filter

library(dplyr)
library(ggplot2)
library(corrr)
library(dbplot)
library(lubridate)
library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

library(leaflet)
library(sf)

## Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE

library(maps)

## 
## Attaching package: 'maps'

## The following object is masked from 'package:purrr':
## 
##     map

library(httr)
library(jsonlite)

## 
## Attaching package: 'jsonlite'

## The following object is masked from 'package:purrr':
## 
##     flatten

#install.packages("stringr")
library(stringr)

# Connect to local Spark cluster
sc <- spark_connect(master = "local", version = "2.4")

# Load crash and vehicle datasets
crash_data <- spark_read_csv(sc, path = "data/Traffic_Crashes.csv", 
                             name ='traffic_crash', memory = FALSE)
vehicle_data <- spark_read_csv(sc, path = "data/Traffic_Crashes_Vehicles.csv", 
                               name ='vehicle', memory = FALSE)

# Combine datasets
combined_data <- crash_data %>%
  inner_join(vehicle_data, by = "CRASH_RECORD_ID")

###################### 3. DATA PREPROCESSING ###############################
# First select all relevant columns
# Fix the group by with corrected string concatenation
combined_data_select <- combined_data %>% 
  filter(UNIT_TYPE == "DRIVER") %>%
  mutate(
    MAKE = sql("TRIM(BOTH ' ' FROM REPLACE(REPLACE(REPLACE(REPLACE(MAKE, '\"', ''), '''', ''), '(', ''), ')', ''))"),
    MODEL = sql("TRIM(BOTH ' ' FROM REPLACE(REPLACE(REPLACE(REPLACE(MAKE, '\"', ''), '''', ''), '(', ''), ')', ''))")
  ) %>%
  select(
    CRASH_RECORD_ID, CRASH_DATE_x, POSTED_SPEED_LIMIT, TRAFFIC_CONTROL_DEVICE, 
    DEVICE_CONDITION, WEATHER_CONDITION, LIGHTING_CONDITION, FIRST_CRASH_TYPE, 
    TRAFFICWAY_TYPE, ALIGNMENT, DAMAGE, MOST_SEVERE_INJURY,
    PRIM_CONTRIBUTORY_CAUSE, ROAD_DEFECT,TRAFFIC_CONTROL_DEVICE, DEVICE_CONDITION, 
    INJURIES_TOTAL, CRASH_HOUR, CRASH_DAY_OF_WEEK, CRASH_MONTH, LATITUDE, LONGITUDE, 
    MAKE, MODEL, VEHICLE_YEAR, MANEUVER
  ) %>%
  group_by(CRASH_RECORD_ID) %>%
  summarise(
    CRASH_DATE = min(CRASH_DATE_x),
    POSTED_SPEED_LIMIT = min(POSTED_SPEED_LIMIT),
    WEATHER_CONDITION = min(WEATHER_CONDITION),
    LIGHTING_CONDITION = min(LIGHTING_CONDITION),
    FIRST_CRASH_TYPE = min(FIRST_CRASH_TYPE),
    TRAFFICWAY_TYPE = min(TRAFFICWAY_TYPE),
    ALIGNMENT = min(ALIGNMENT),
    DAMAGE = min(DAMAGE),
    MOST_SEVERE_INJURY = min(MOST_SEVERE_INJURY),
    PRIM_CONTRIBUTORY_CAUSE = min(PRIM_CONTRIBUTORY_CAUSE),
    ROAD_DEFECT = min(ROAD_DEFECT),
    TRAFFIC_CONTROL_DEVICE = min(TRAFFIC_CONTROL_DEVICE),
    DEVICE_CONDITION = min(DEVICE_CONDITION),
    INJURIES_TOTAL = max(INJURIES_TOTAL),
    CRASH_HOUR = min(CRASH_HOUR),
    CRASH_DAY_OF_WEEK = min(CRASH_DAY_OF_WEEK),
    CRASH_MONTH = min(CRASH_MONTH),
    LATITUDE = min(LATITUDE),
    LONGITUDE = min(LONGITUDE),
    VEHICLES_INVOLVED = n(),
    VEHICLE_YEARS = mean(VEHICLE_YEAR, na.rm = TRUE),
    VEHICLE_TYPE = concat_ws(", ", collect_list(paste(MAKE, MODEL))), # combine the MAKE and Model
    MANEUVERS = concat_ws(", ", collect_list(MANEUVER))
  )

# -------- Data cleaning --------
# After this completes, then we can add the year
combined_data_select <- combined_data_select %>%
  mutate(CRASH_YEAR = year(to_date(CRASH_DATE, "MM/dd/yyyy")))

## Warning: Missing values are always removed in SQL aggregation functions.
## Use `na.rm = TRUE` to silence this warning
## This warning is displayed once every 8 hours.

# remove the DATE column since we wont be using it
#combined_data_select <- combined_data_select %>% select(-CRASH_DATE)
# remove rows with MOST_SEVERE_INJURY == REPORTED, NOT EVIDENT
combined_data_select <- combined_data_select %>% 
  filter(MOST_SEVERE_INJURY != "REPORTED, NOT EVIDENT")
# Remove rows with missing coordinates
clean_data <- combined_data_select %>%
  filter(!is.na(LATITUDE) & !is.na(LONGITUDE))
clean_data_final <- clean_data %>%
  mutate(VEHICLE_YEARS = coalesce(VEHICLE_YEARS, 2014))

Descriptive Statistics

The data shows that most incidents, about 89.9%, had no indication of injury, making this the largest category. Non-incapacitating injuries accounted for 8.22% of the cases, while incapacitating injuries made up 1.75%. Fatal cases were the rarest, with only 0.114% of the total incidents. The fatal injury rates show minimal variation over the years, consistently accounting for a very small percentage of total incidents. From 2015 to 2024, fatal cases ranged between 0.03% and 0.15%. The highest fatality rates occurred in 2020 and 2021 at 0.15%, while the lowest was observed in 2015 and 2016 at 0.03%. These statistics suggest that fatalities remain rare but relatively stable, with no significant upward or downward trend over the years.

 MOST_SEVERE_INJURY        Count Percentage
  <chr>                     <dbl>      <dbl>
1 NO INDICATION OF INJURY  762898     89.9  
2 NONINCAPACITATING INJURY  69773      8.22 
3 FATAL                       970      0.114
4 INCAPACITATING INJURY     14840      1.75

Descriptive Statistics of severe injury by year.
Year	Count	Fatal (%)	Incapacitating Injury (%)	No Indication of Injury (%)	Nonincapacitating Injury (%)
2015	9485	0.03	0.54	97.35	2.08
2016	42819	0.03	0.64	96.56	2.76
2017	79953	0.10	1.55	92.13	6.22
2018	113105	0.10	1.87	89.89	8.14
2019	111456	0.09	1.79	89.97	8.15
2020	87149	0.15	2.14	88.49	9.22
2021	103027	0.15	1.94	88.72	9.19
2022	102412	0.13	1.88	89.01	8.98
2023	104147	0.13	1.79	89.03	9.05
2024	94759	0.11	1.59	88.82	9.48
2014	6	0.00	0.00	83.33	16.67

Temporal Analysis

A trend analysis of the data shows an increase in the number of crashes from 2013 to 2018, followed by a decrease with some fluctuations thereafter. This trend suggests that while traffic-related incidents were on the rise in the earlier years, recent years have shown a decline or stabilization.

Trend Occurrence of Crashes

Further analysis of the data reveals that Friday is the day with the highest occurrence of crashes, likely due to increased traffic on this day. In terms of time, the hours between 12:00 PM and 5:59 PM account for the highest number of crashes, which could coincide with lunch breaks and after-work traffic, leading to higher accident rates. Following this, there are fewer crashes in the morning and late night (12:00 AM - 5:59 AM), which may reflect lower traffic volumes during these times. The data also shows that the number of crashes increases from May to October, which may correlate with better weather conditions and an increase in outdoor activities. On the other hand, a decrease in crashes during the winter months may be attributed to harsher weather conditions, which deter driving.

Temporal Distribution

Environmental Conditions & Primary Cause

Most crashes occurred under clear weather conditions and daylight, aligning with the temporal analysis that highlights peak crash times between 12 PM and 5:59 PM. The primary cause of these crashes was “Failing to yield the right of way,” followed by “Following too closely” and “Improper overtaking/passing.” Weather played a minimal role, accounting for only 1.4% of crashes. Notably, “Failing to yield the right of way” was the leading cause across all three injury categories, underscoring its significant impact on crash incidents.

Environmental Condition

Spatial Distribution

The map shows several high-concentration areas of crashes, indicated by the orange-red circles with numbers: 3823 crashes in the central Chicago area 1792 crashes near US 41 1432 crashes in the Oak Park/Elmwood Park area 1189 crashes in the southern part of the city 381 crashes near Blue Island 175 crashes in the Norridge area Most Dangerous Intersections The most hazardous intersections in Chicago are: East 79th Street and South Stony Island Avenue (nearly 200 crashes in three years)6 East 95th Street and South Stony Island Avenue (over 150 accidents)6 West Garfield Boulevard and South Wentworth Avenue6 Crash Statistics Chicago experiences approximately 270 car crashes every day (one crash every 5-6 minutes)6 The city accounts for about one-third of all crashes in Illinois6 In 2021, Chicago had 98,563 crashes, a 21% increase from the previous year6 Fatal accidents have been increasing, with 230 fatalities in 2021 compared to 132 in 20186 Geographic Patterns High concentration of crashes along major arterial roads Significant clustering in downtown and near-downtown areas Higher crash rates at complex intersections, especially those with multiple bus routes or highway ramps6 Notable crash patterns along the lakefront corridor This visualization helps identify areas that need immediate attention for traffic safety improvements and infrastructure upgrades.

#---------------------------- 6. VISUALIZATIONS  ----------------------------

### A. SPATIAL ANALYSIS (Location-based visualizations) ###
# A.1. Prepare Chicago crash data for mapping
# Filter to Chicago boundaries and sample for performance
chicago_crashes <- clean_data_final %>%
  filter(
    LATITUDE >= 41.6 & LATITUDE <= 42.1,  # Approximate Chicago bounds
    LONGITUDE >= -87.9 & LONGITUDE <= -87.5
  ) %>%
  sdf_sample(fraction = 0.01) %>%  # Take 1% sample to avoid browser overload
  collect()

# A.2. Create interactive map with hover effects and clustering
chicago_map <- chicago_crashes %>%
  leaflet() %>%
  addTiles() %>%  
  setView(lng = -87.6298, lat = 41.8781, zoom = 11) %>%
  addCircleMarkers(
    lng = ~LONGITUDE,
    lat = ~LATITUDE,
    radius = 8,
    fillOpacity = 0.7,
    color = "red",
    weight = 1,
    label = ~paste("Crash Type:", FIRST_CRASH_TYPE),
    popup = ~paste(
      "<b>Crash Details:</b><br>",
      "Date:", CRASH_DATE, "<br>",
      "Type:", FIRST_CRASH_TYPE, "<br>",
      "Weather:", WEATHER_CONDITION, "<br>",
      "Injuries:", INJURIES_TOTAL, "<br>",
      "Vehicles Involved:", VEHICLES_INVOLVED
    ),
    clusterOptions = markerClusterOptions(
      spiderfyOnMaxZoom = TRUE,
      showCoverageOnHover = TRUE,
      zoomToBoundsOnClick = TRUE
    ),
    labelOptions = labelOptions(
      style = list("font-weight" = "normal", padding = "3px 8px"),
      textsize = "15px",
      direction = "auto"
    )
  ) %>%
  addLegend(
    position = "bottomright",
    colors = "red",
    labels = "Crash Location",
    title = "Legend"
  )
chicago_map

Trafficway Type

The analysis reveals that NOT DIVIDED roads have the highest number of fatal crashes, with 361 fatal accidents recorded. This finding suggests that roads without physical separation between opposing traffic flows pose the greatest risk for fatal accidents.

TRAFFICWAY_TYPE                 MOST_SEVERE_INJURY total_crashes
   <chr>                           <chr>                      <dbl>
 1 NOT DIVIDED                     FATAL                        361
 2 DIVIDED - W/MEDIAN (NOT RAISED) FATAL                        188
 3 DIVIDED - W/MEDIAN BARRIER      FATAL                        123
 4 FOUR WAY                        FATAL                        107
 5 ONE-WAY                         FATAL                         90
 6 T-INTERSECTION                  FATAL                         23
 7 OTHER                           FATAL                         23
 8 CENTER TURN LANE                FATAL                         15
 9 ALLEY                           FATAL                         11
10 PARKING LOT                     FATAL                         11

Prediction

To predict the severity of crashes, three advanced machine learning models were employed: Random Forest, XGBoost, and CatBoost. Each of these models offers unique strengths in handling complex, high-dimensional datasets and capturing non-linear relationships between variables (Ahmed et al., 2023).

Random Forest

Random Forest is an ensemble learning method that constructs multiple decision trees and merges them to produce a more accurate and stable prediction. It is particularly effective in handling large datasets with high dimensionality and can provide insights into feature importance (Breiman, 2001).

XGBoost

XGBoost (Extreme Gradient Boosting) is a scalable and highly efficient implementation of gradient boosting machines. It has gained popularity in machine learning competitions due to its speed and performance. XGBoost is known for its ability to handle sparse data and its built-in regularization to prevent overfitting (Chen & Guestrin, 2016).

CatBoost

CatBoost is a gradient boosting library developed by Yandex, specifically designed to handle categorical features efficiently. It implements symmetric trees and ordered boosting to reduce prediction shift and improve accuracy. CatBoost is particularly useful when dealing with datasets that contain many categorical variables (Prokhorenkova et al., 2018).

These ensemble methods have shown superior performance in similar classification tasks, offering robust predictions and insights into feature importance (Dong et al., 2022). By employing these diverse models, the study aims to capture a comprehensive understanding of the factors contributing to crash severity and provide accurate predictions to inform road safety measures.

Model Evaluation & Results

The dataset was split into 70% training and 30% testing data. Due to the dataset’s extreme imbalance, several models struggled with misclassification, particularly across the minority classes. However, after addressing the imbalance, the following results were observed:

Random Forest achieved the highest overall accuracy of 69.6%.
XGBoost and CatBoost followed with accuracies of 57.9% and 57.5%, respectively.

Despite Random Forest showing better overall accuracy, the confusion matrix analysis revealed that CatBoost outperformed the other models in predicting the FATAL class:

Random Forest predicted only 6 instances correctly as FATAL.
XGBoost improved with 116 correct predictions for FATAL.
CatBoost was the best performer for this class, with 158 correct predictions for FATAL.

This demonstrates that CatBoost had better sensitivity for the FATAL class, which is crucial in highly imbalanced datasets.

Models Accuracy Comparison

Random Forest Confusion Matrix

Catboost Confusion Matrix

XGBoost Confusion Matrix

Feature Importance

Conclusion

This research provides a comprehensive analysis of traffic crashes in Chicago, identifying key factors contributing to accidents and predicting crash severity using advanced machine learning models. The insights address all the business questions posed, offering actionable recommendations to improve road safety.

The overall model accuracy was not as strong as anticipated, reflecting the challenges inherent in predicting crash severity from the dataset. However, among the models tested, CatBoost demonstrated superior performance, particularly in classifying fatal crash categories.

Feature importance analysis from the CatBoost model highlighted critical factors influencing crash severity. These include: Primary Contributing Cause, Lighting Conditions, Crash Hour and Month. Temporal patterns, such as specific times of the day and months of the year, significantly impacted severity predictions.

While the accuracy metrics indicate room for improvement, the insights derived from CatBoost provide valuable guidance for focusing interventions on the most impactful factors.

Answers to Business Questions

Key Factors Contributing to Severe Traffic Accidents
- “Failing to yield the right of way” is the primary cause of crashes, regardless of severity.
- Roads that are not divided by medians or barriers have the highest fatality rates, emphasizing the risk associated with traffic flows without physical separation.
- Temporal factors, including time of day (12 PM to 5:59 PM) and day of the week (Fridays), correlate with higher crash occurrences.
Patterns Across Time
- Crashes are most frequent between 12 PM and 5:59 PM and during May to October, reflecting higher activity during these periods.
- Friday records the highest crash frequency, likely due to increased end-of-week travel.
Impact of Weather and Lighting Conditions
- Most crashes occur under clear weather and daylight conditions. Weather-related factors contribute minimally (1.4%) to crash occurrences.
- Dark conditions without proper lighting significantly increase crash severity, emphasizing the need for better street lighting.
High-Risk Locations and Geographic Patterns
- Central Chicago and major arterial roads are hotspots for crashes. Specific intersections, such as East 79th Street and South Stony Island Avenue, show a higher concentration of incidents, requiring targeted interventions.
- Complex intersections and areas near highway ramps are particularly hazardous.
Roadway Type and Crash Severity
- Undivided roads have the highest fatality rates, underscoring the need for infrastructure changes, such as installing medians or barriers.
Trends Over the Years
- While crash rates peaked in 2018, there has been a decline in recent years, indicating potential improvements in road safety measures. However, the number of fatalities has shown little variation, calling for more effective interventions.

Recommendation

To improve road safety and reduce crash severity, several infrastructure, traffic management, and public awareness strategies are recommended based on the feature importance analysis and identified factors influencing crash severity:

Infrastructure Improvements
- Install medians or barriers on undivided roads to reduce fatal crashes.
- Enhance street lighting, especially in high-risk areas with poor lighting conditions.
Traffic Management
- Implement stricter enforcement of traffic laws, particularly for “failing to yield the right of way.”
- Deploy additional traffic control devices at high-risk intersections.
Public Awareness Campaigns
- Educate drivers about common causes of accidents and safe driving practices, especially targeting peak crash hours and days.
Seasonal Interventions
- Launch targeted road safety campaigns during the May-October period when crash rates are highest.
Technology and Data Utilization
- Use predictive models to identify and address emerging risk factors proactively.
- Integrate real-time data systems for traffic monitoring and incident response.

Reference

Ahmed, S. S., Hulme, K. F., Fountas, G., & Eker, U. (2023). An ensemble learning approach for injury severity prediction in traffic crashes. Accident Analysis & Prevention, 180, 106896.

Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32. Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785-794.

Dong, B., Ma, X., Chen, F., & Chen, S. (2022). An advanced machine learning approach to predicting injury severity of traffic crashes. Reliability Engineering & System Safety, 217, 108090.

Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., & Gulin, A. (2018). CatBoost: Unbiased boosting with categorical features. Advances in Neural Information Processing Systems, 31, 6638-6648.

Appendix

Column Name	Meaning	Example Value
CRASH_RECORD_ID	Unique identifier for each crash record.	12345678
CRASH_DATE	The date when the crash occurred.	2023-10-15
POSTED_SPEED_LIMIT	The posted speed limit at the crash location (in mph).	35
WEATHER_CONDITION	The weather condition at the time of the crash.	Clear, Rain, Snow, Fog
LIGHTING_CONDITION	The lighting condition during the crash.	Daylight, Dark (Lighted Road), Dark (Unlit)
FIRST_CRASH_TYPE	The type of crash (e.g., rear-end, sideswipe, head-on).	Rear-End, Single Vehicle, Head-On
TRAFFICWAY_TYPE	Type of trafficway (e.g., street, highway).	Street, Expressway, Interstate
ALIGNMENT	The alignment of the road at the crash location (straight or curved).	Straight, Curved
DAMAGE	The damage level (e.g., property damage, injury, fatal).	Property Damage Only, Injury, Fatal
PRIM_CONTRIBUTORY_CAUSE	The main contributing cause of the crash (e.g., speeding, weather).	Speeding, Distracted Driving, Weather
ROAD_DEFECT	Any defects on the road (e.g., potholes, wet road).	None, Pothole, Wet Road
TRAFFIC_CONTROL_DEVICE	Type of traffic control device present at the crash location.	Traffic Signal, Stop Sign, Yield Sign
DEVICE_CONDITION	Condition of the traffic control device at the time of the crash.	Functional, Non-Functional, Obstructed
INJURIES_TOTAL	Total number of injuries in the crash.	0, 2, 5
CRASH_HOUR	The hour of the day when the crash occurred (24-hour format).	14 (2 PM), 03 (3 AM)
CRASH_DAY_OF_WEEK	The day of the week when the crash occurred.	Monday, Friday, Sunday
CRASH_MONTH	The month when the crash occurred.	1 (January), 12 (December)
LATITUDE	Latitude of the crash location.	41.8781
LONGITUDE	Longitude of the crash location.	-87.6298
VEHICLES_INVOLVED	The number of vehicles involved in the crash.	2, 1, 3
VEHICLE_YEARS	The average year of the vehicles involved in the crash.	2015.5
MAKES	List of makes of the vehicles involved in the crash.	“Toyota, Ford”, “Honda”
MODELS	List of models of the vehicles involved in the crash.	“Camry, Focus”, “Civic”
MANEUVERS	List of maneuvers performed by vehicles involved in the crash.	“Turning Left”, “Changing Lanes”
CRASH_YEAR	The year when the crash occurred.	2023

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Traffic Crashes Analysis

Sheila Boateng, Nahimah Suglo, Taiwo Ayeni

2024-12-08