This report details the Natural Language Processing (NLP) analysis of atico.csv, a dataset containing textual descriptions of mobility incidents from emergency room records. The primary objective is to transform this unstructured text into actionable insights by identifying key risk factors and accident patterns involving pedestrians, cyclists, and micromobility users. Our methodology leverages the R ecosystem, primarily the tidyverse for data manipulation and visualization.
2. Data Loading and Preparation
First, we load the necessary R libraries and the dataset. The data is cleaned by selecting relevant columns, filtering out entries without descriptions, and standardizing column names.
Code
# STEP 1: LOAD LIBRARIES# ------------------------------------# We load the required libraries for the entire analysis.library(tidyverse)library(tidytext)library(wordcloud)library(RColorBrewer)library(forcats)# STEP 2: LOAD DATA# ------------------------------------# Using the specific file path provided.file_path <-"/Users/nataliochoa/Documents/NLP/atico.csv"# Using read.csv from base R as requestedaccidents_df <-read.csv(file_path, stringsAsFactors =FALSE, fileEncoding ="UTF-8-BOM")# STEP 3: PREPARE AND CLEAN DATA# ------------------------------------data_to_plot <- accidents_df %>%select(participant_role_detail, CollisionGroup, accident_detail) %>%filter(!is.na(accident_detail) & accident_detail !="") %>%mutate(participant_role_detail =str_replace_all(participant_role_detail, "_", " ") %>%str_to_title(),# Clean up CollisionGroup for better plottingCollisionGroup =if_else(is.na(CollisionGroup) | CollisionGroup =="Not collision", "No-Collision Incident", str_to_title(CollisionGroup)) )cat(paste("Successfully loaded and prepared", nrow(data_to_plot), "incident records for analysis.\n"))
Successfully loaded and prepared 1729 incident records for analysis.
3. Feature Engineering: Accident Severity Score
To quantify the severity of each incident, we developed a proxy score based on keywords found in the descriptions. Words associated with severe outcomes (e.g., “fracture”, “unconscious”) are assigned higher weights. This allows us to compare severity across different incident types.
Code
# Keyword dictionary with associated weightsseverity_keywords <-c("unconscious"=10, "consciousness"=10, "fracture"=8, "head"=7, "blacked out"=7, "fell hard"=6, "flung"=5, "thrown"=5, "flew"=5, "rolled"=4, "hit"=3, "bruise"=2, "scraped"=1, "scratched"=1)# Function to calculate the severity scorecalculate_severity_r <-function(text) { text_lower <-tolower(paste(text, collapse =" "))if (nchar(text_lower) ==0) return(0) score <-sum(sapply(names(severity_keywords), function(keyword) {str_count(text_lower, fixed(keyword)) * severity_keywords[[keyword]] }))return(score)}# Apply the function to create the new 'severity_score' columndata_to_plot <- data_to_plot %>%rowwise() %>%mutate(severity_score =calculate_severity_r(accident_detail)) %>%ungroup()cat("Severity scores calculated. Displaying top 5 most severe incidents by score:\n")
Severity scores calculated. Displaying top 5 most severe incidents by score:
The following visualizations highlight the key patterns discovered in the data.
4.1. Most Common Accident Scenarios
Collisions with cars and non-collision incidents (e.g., falls) are the two most frequent categories, underscoring where safety efforts should be focused.
Code
ggplot(data_to_plot, aes(y =fct_rev(fct_infreq(CollisionGroup)), fill = CollisionGroup)) +geom_bar(show.legend =FALSE) +geom_text(stat ='count', aes(label =after_stat(count)), hjust =-0.3, size =3.5) +labs(title ="Most Common Accident Scenarios",x ="Number of Incidents",y ="Primary Accident Cause" ) +theme_minimal(base_size =12)
Figure 1: This bar chart displays the total count of incidents categorized by their primary cause, clearly showing the dominance of vehicle collisions and falls.
4.2. Severity by Participant Role
While all participants are at risk, pedestrians consistently experience incidents with the highest average severity scores, followed closely by cyclists.
Code
ggplot(data_to_plot, aes(x = severity_score, y =reorder(participant_role_detail, severity_score, FUN = median), fill = participant_role_detail)) +geom_boxplot(show.legend =FALSE, alpha =0.8) +labs(title ="Accident Severity Score by Participant Role",x ="Severity Score (Keyword-based)",y ="Participant Role" ) +theme_minimal(base_size =12)
Figure 2: This box plot visualizes the distribution of severity scores for each participant role, with the median represented by the central line.
4.3. Key Risk Factors from Textual Data
A word cloud of the most frequent terms in the incident descriptions highlights the core components of these events: intersections, sidewalks, left turns, and falls are dominant themes.