Section 1

Based on a dataset that you find interesting, propose a complete data mining project. The structure of the response must align with the typical phases of the data mining project lifecycle. There is no need to carry out the tasks of each phase.

It is expected that the response will address the following questions in a structured manner (using the CRISP-DM methodology):

Business Understanding - What does the business need?
Data Understanding - What data do we have/need? Are they clean?
Data Preparation - How do we organize the data for modeling?
Modeling - What modeling techniques should we apply?
Evaluation - Which model best meets the business objectives?
Deployment - How will stakeholders access the results?

For each phase, indicate the phase’s objective and the expected outcome. Use examples to illustrate what and how the tasks could be performed. If there are any unique characteristics that differentiate the lifecycle of a data mining project from other types of projects, highlight them.

Proposed Project - Fire and Smoke Detection

This project focuses on developing an effective system to enhance safety in industrial and urban environments, where fires can occur at any time. Early detection of fire and smoke is crucial for protecting lives and property.

Selected Dataset: https://www.kaggle.com/datasets/roscoekerby/firesmoke-detection-yolo-v9

Business Understanding

In this phase, the goal is to identify the business needs and expectations regarding the fire and smoke detection system. It is essential to understand current risks, the areas most prone to fires, and how this system can enhance safety and emergency response. By the end of this stage, we should be able to answer key questions and determine whether the metrics used can define success or failure.

We will conduct interviews with key stakeholders, such as corporate safety officers and firefighters, analyze statistics on fire and smoke incidents in different environments (industrial and urban), and establish success criteria, such as a target percentage of correct detections and response time.

This system can benefit various industries seeking to protect customers, employees, or civilians, such as occupational safety companies, chemical industries handling flammable materials, and emergency response services for firefighters, among others.

Question: What does the business need?
Answer: The business needs an efficient fire and smoke detection system that enhances safety and emergency response across various industries.

Data Understanding

In this phase, the objective is to explore the available dataset to understand its quality, structure, and relevance to the project. The goal is to answer questions such as: What data do we have? Are they sufficient and representative? Are there quality issues? The expected outcome is a descriptive report on the dataset, including statistical summaries and quality analysis.

This phase also involves examining the number of images per class (fire and smoke) to ensure a balanced dataset, verifying annotation quality through sampling to ensure bounding boxes are correctly defined, and identifying and documenting issues such as low-quality, blurry, misclassified, or duplicate images.

According to the dataset, we have two classes (translated literally):

0: Fire – Images containing visible flames or areas where a fire is clearly present.
1: Smoke – Images with visible smoke, either in the early stages of fire development or due to environmental factors.

Dataset Composition:

The dataset includes over 35,000 labeled images for fire and smoke detection.
The images vary in lighting conditions, resolutions, and environmental contexts to ensure the model generalizes well across different real-world scenarios.
Training Data: Images used for model training, with balanced examples of fire and smoke.
Validation Data: Used to fine-tune model hyperparameters and validate performance.
Test Data: Held-out data for final model evaluation, containing unseen images of fire and smoke.

Questions:

Question: What data do we have/need? Are they clean?
Answer: We have over 35,000 labeled images of fire and smoke, but we need to verify annotation quality and ensure proper class balance.

Data Preparation

Data preparation involves cleaning, transforming, and organizing the dataset to ensure it is ready for modeling. This phase is crucial to minimizing errors in modeling and maximizing the quality of the results. The goal is a structured and clean dataset that can be used to train detection models.

During this phase, key tasks include normalizing images to a standard size and format (JPEG), correcting annotation errors such as incorrect bounding box coordinates or misassigned labels, and splitting the dataset into training, validation, and test sets while maintaining class distribution.

“Once data sources are identified, we must proceed with preparing them so that they can be used with the methods or tools that will build the desired model. This phase, although seemingly simple, along with data selection, consumes 70% (or more!) of the effort in newly implemented data mining projects.” Page 20 - PID_00284574

Question: How do we organize the data for modeling?
Answer: We organize the data by normalizing images, correcting annotations, and splitting the dataset into training, validation, and test sets.

Modeling

In this phase, the goal is to select and apply the best techniques to solve the problem of detecting fire and smoke. The objective is to create a model that generalizes well to new data and meets the defined success criteria. By the end of this phase, we will have trained and validated object detection models ready for evaluation.

…include implementing various object detection algorithms, such as YOLO and Faster R-CNN, while tuning their hyperparameters to optimize performance, performing cross-validation to assess the model’s robustness across different data subsets, and analyzing performance metrics like precision and recall to identify which models perform best for each class…

https://kili-technology.com/data-labeling/machine-learning/yolo-algorithm-real-time-object-detection-from-a-to-z

“The task of these data mining projects is not exactly the same as in the previous point. Here, it is more common to start from a more informed situation, knowing that pre-defined groups already exist.” Page 13 - PID_00284574

Question: What modeling techniques should we apply?
Answer: We will apply object detection algorithms such as YOLO and Faster R-CNN, fine-tuning their hyperparameters to optimize performance.

Evaluation

The goal is to determine whether the model meets business requirements and expectations. An evaluation report will be generated, including a performance analysis and recommendations.

At this stage, we seek to assess whether the model fulfills the pre-established requirements and expectations, evaluating its usefulness both from a technical and business perspective. Validation techniques will be employed to measure model performance across different datasets, providing a solid evaluation. Additionally, evaluation metrics such as precision, recall, and F1-score will be analyzed to gain deeper insight into model behavior.

(https://www.themachinelearners.com/metricas-de-clasificacion/)

It is also essential to compare the model’s results with other available alternatives. This comparison is crucial to determine the model’s effectiveness and whether other approaches should be explored. This process is not linear; instead, it involves continuous review and refinement to ensure that the model remains relevant and effective as conditions change and new challenges arise.

“This process is not linear; rather, it is iterative and continuous: new changes in the situation may render our knowledge outdated, requiring us to extract new insights.” Page 12 - PID_00284574

Question: Which model best meets the business objectives?
Answer: We will evaluate trained models using cross-validation and performance metrics to determine which best meets business objectives.

Deployment

The final phase of the project lifecycle aims to integrate the model into an operational system, allowing end users to access its results effectively. The fire and smoke detection system is deployed, ensuring that it functions properly in a real-world environment and meets the established requirements.

Additionally, continuous monitoring of the model is essential to ensure it continues to operate optimally. This includes making periodic adjustments based on results and changing conditions, guaranteeing that the system maintains its effectiveness and continues to achieve the project objectives. (This involves continuously reapplying Phase 5: Evaluation.)

“Once the objective is defined, and when we have linked it to the project’s main task, identifying which models are most relevant and what methods and tools are needed, we must proceed to find the raw material: the data.” Page 19 - PID_00284574

Question: How do stakeholders access the results?
Answer: Stakeholders will access the results through an integrated system that enables real-time visualization and analysis of detections.

Key Differentiating Characteristics

If there is any characteristic that differentiates the lifecycle of a data mining project from other projects, indicate it.

The lifecycle of a data mining project is iterative and adaptable. Unlike other data analysis projects that may follow a more linear approach, data mining requires continuous revisions and the incorporation of new data.

This ensures that models remain relevant and effective in real-world situations, which is critical for safety applications such as fire and smoke detection.

This approach reflects the need to “define the data mining task” and understand that achieving objectives may require a combined effort, where data is first grouped, then classified, and finally, a predictive model is extracted.

This iterative process is also closely linked to “model evaluation and interpretation,” where continuous validation and model adaptation to new circumstances are essential for project success. Additionally, ensuring that the extracted knowledge is valid and applicable in practice is a key principle in data mining, reinforcing the need for effective integration into the organization’s information system.

This iterative cycle not only enables continuous model improvement but also aligns with the goal of “explaining” behaviors, allowing analysts to understand and adjust the reasoning behind the results. This is particularly crucial in critical environments like fire and smoke detection.

Bibliography

https://openaccess.uoc.edu/bitstream/10609/71345/4/Business%20analytics_M%C3%B3dulo%203_Metodolog%C3%ADas%20y%20est%C3%A1ndares.pdf
https://www.iic.uam.es/innovacion/metodologia-crisp-dm-ciencia-de-datos/
https://es.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining
https://www.datascience-pm.com/crisp-dm-2/
https://www.youtube.com/watch?v=UyKkSsEbXkw
https://kili-technology.com/data-labeling/machine-learning/yolo-algorithm-real-time-object-detection-from-a-to-z
Data Mining Process PID_00284574 Julià Minguillón Alfonso and Ramon Caihuelas Quiles
https://www.themachinelearners.com/metricas-de-clasificacion/

Section 2

Using the dataset from the PEC example, perform the preliminary tasks for generating a data mining model, as explained in the modules “The Data Mining Process” and “Data Preprocessing and Feature Management”.

You may use the PEC example as a reference, but you should change the approach and analyze the data based on different dimensions. Thus, you cannot use the same combination of variables as in the example: "FATALS", "DRUNK_DR", "VE_TOTAL", "VE_FORMS", "PVH_INVL", "PEDS", "PERSONS", "PERMVIT", "PERNOTMVIT". You must analyze any other combination, which may include some of these variables along with new ones.

Optionally, and as an added value, you may incorporate data from other years for temporal comparisons (https://www.nhtsa.gov/file-downloads?p=nhtsa/downloads/FARS/) or include additional factors for analysis, such as drug use in accidents (https://static.nhtsa.gov/nhtsa/downloads/FARS/2020/National/FARS2020NationalCSV.zip).

Dataset Source Description

A dataset from the National Highway Traffic Safety Administration (NHTSA) for the year 2020 has been selected. This dataset records accidents with at least one fatality. The objective is to understand what factors contribute to an accident being classified as severe and what defines this severity.

https://www.nhtsa.gov/crash-data-systems/fatality-analysis-reporting-system
[National Highway Traffic Safety Administration] (https://www.nhtsa.gov/)

Initial Preparation and Exploratory Analysis

Before starting the analysis, we install the necessary libraries using an if statement to check if they are already installed, preventing conflicts in our code.

if (!require('ggplot2')) install.packages('ggplot2'); library('ggplot2')

## Cargando paquete requerido: ggplot2

if(!require('Rmisc')) install.packages('Rmisc'); library('Rmisc')

## Cargando paquete requerido: Rmisc

## Cargando paquete requerido: lattice

## Cargando paquete requerido: plyr

if(!require('dplyr')) install.packages('dplyr'); library('dplyr')

## Cargando paquete requerido: dplyr

## 
## Adjuntando el paquete: 'dplyr'

## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

if(!require('xfun')) install.packages('xfun'); library('xfun')

## Cargando paquete requerido: xfun

## 
## Adjuntando el paquete: 'xfun'

## The following object is masked from 'package:base':
## 
##     attr

if(!require('factoextra')) install.packages('factoextra', dependencies = TRUE)

## Cargando paquete requerido: factoextra

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

if(!require('mice')) install.packages('mice', dependencies = TRUE)

## Cargando paquete requerido: mice

## 
## Adjuntando el paquete: 'mice'

## The following object is masked from 'package:stats':
## 
##     filter

## The following objects are masked from 'package:base':
## 
##     cbind, rbind

Load the dataset and use the same naming conventions as in the guided example. Finally, we examine the structure of the dataset.

path = 'accident.CSV'
accidentData <- read.csv(path, row.names=NULL)
structure = str(accidentData)

## 'data.frame':    35766 obs. of  81 variables:
##  $ STATE       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ STATENAME   : chr  "Alabama" "Alabama" "Alabama" "Alabama" ...
##  $ ST_CASE     : int  10001 10002 10003 10004 10005 10006 10007 10008 10009 10010 ...
##  $ VE_TOTAL    : int  1 4 2 1 1 2 1 2 2 2 ...
##  $ VE_FORMS    : int  1 4 2 1 1 2 1 2 2 2 ...
##  $ PVH_INVL    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PEDS        : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ PERSONS     : int  4 6 2 5 1 3 1 2 4 3 ...
##  $ PERMVIT     : int  4 6 2 5 1 3 1 2 4 3 ...
##  $ PERNOTMVIT  : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ COUNTY      : int  51 73 117 15 37 103 73 25 45 95 ...
##  $ COUNTYNAME  : chr  "ELMORE (51)" "JEFFERSON (73)" "SHELBY (117)" "CALHOUN (15)" ...
##  $ CITY        : int  0 350 0 0 0 0 330 0 0 1500 ...
##  $ CITYNAME    : chr  "NOT APPLICABLE" "BIRMINGHAM" "NOT APPLICABLE" "NOT APPLICABLE" ...
##  $ DAY         : int  1 2 2 3 4 4 7 8 9 10 ...
##  $ DAYNAME     : int  1 2 2 3 4 4 7 8 9 10 ...
##  $ MONTH       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ MONTHNAME   : chr  "January" "January" "January" "January" ...
##  $ YEAR        : int  2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
##  $ DAY_WEEK    : int  4 5 5 6 7 7 3 4 5 6 ...
##  $ DAY_WEEKNAME: chr  "Wednesday" "Thursday" "Thursday" "Friday" ...
##  $ HOUR        : int  2 17 14 15 0 16 19 7 20 10 ...
##  $ HOURNAME    : chr  "2:00am-2:59am" "5:00pm-5:59pm" "2:00pm-2:59pm" "3:00pm-3:59pm" ...
##  $ MINUTE      : int  58 18 55 20 45 55 23 15 0 2 ...
##  $ MINUTENAME  : chr  "58" "18" "55" "20" ...
##  $ NHS         : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ NHSNAME     : chr  "This section IS NOT on the NHS" "This section IS NOT on the NHS" "This section IS NOT on the NHS" "This section IS NOT on the NHS" ...
##  $ ROUTE       : int  4 6 3 4 4 3 4 4 4 2 ...
##  $ ROUTENAME   : chr  "County Road" "Local Street - Municipality" "State Highway" "County Road" ...
##  $ TWAY_ID     : chr  "cr-4" "martin luther king jr dr" "sr-76" "CR-ALEXANDRIA WELLINGTON RD" ...
##  $ TWAY_ID2    : chr  "" "" "us-280" "" ...
##  $ RUR_URB     : int  1 2 1 1 1 1 2 1 1 1 ...
##  $ RUR_URBNAME : chr  "Rural" "Urban" "Rural" "Rural" ...
##  $ FUNC_SYS    : int  5 4 4 7 5 4 4 5 5 3 ...
##  $ FUNC_SYSNAME: chr  "Major Collector" "Minor Arterial" "Minor Arterial" "Local" ...
##  $ RD_OWNER    : int  2 4 1 2 2 1 4 2 2 1 ...
##  $ RD_OWNERNAME: chr  "County Highway Agency" "City or Municipal Highway Agency" "State Highway Agency" "County Highway Agency" ...
##  $ MILEPT      : int  0 0 49 0 0 390 0 0 0 3019 ...
##  $ MILEPTNAME  : chr  "None" "None" "49" "None" ...
##  $ LATITUDE    : num  32.4 33.5 33.3 33.8 32.8 ...
##  $ LATITUDENAME: chr  "32.43313333" "33.48465833" "33.29994167" "33.79507222" ...
##  $ LONGITUD    : num  -86.1 -86.8 -86.4 -85.9 -86.1 ...
##  $ LONGITUDNAME: chr  "-86.09485" "-86.83954444" "-86.36964167" "-85.88348611" ...
##  $ SP_JUR      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ SP_JURNAME  : chr  "No Special Jurisdiction" "No Special Jurisdiction" "No Special Jurisdiction" "No Special Jurisdiction" ...
##  $ HARM_EV     : int  42 12 34 42 42 12 8 12 12 12 ...
##  $ HARM_EVNAME : chr  "Tree (Standing Only)" "Motor Vehicle In-Transport" "Ditch" "Tree (Standing Only)" ...
##  $ MAN_COLL    : int  0 6 0 0 0 2 0 1 1 2 ...
##  $ MAN_COLLNAME: chr  "The First Harmful Event was Not a Collision with a Motor Vehicle in Transport" "Angle" "The First Harmful Event was Not a Collision with a Motor Vehicle in Transport" "The First Harmful Event was Not a Collision with a Motor Vehicle in Transport" ...
##  $ RELJCT1     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ RELJCT1NAME : chr  "No" "No" "No" "No" ...
##  $ RELJCT2     : int  1 1 3 1 1 1 3 1 8 1 ...
##  $ RELJCT2NAME : chr  "Non-Junction" "Non-Junction" "Intersection-Related" "Non-Junction" ...
##  $ TYP_INT     : int  1 1 3 1 1 1 2 1 1 1 ...
##  $ TYP_INTNAME : chr  "Not an Intersection" "Not an Intersection" "T-Intersection" "Not an Intersection" ...
##  $ WRK_ZONE    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ WRK_ZONENAME: chr  "None" "None" "None" "None" ...
##  $ REL_ROAD    : int  4 1 4 4 4 1 1 1 1 1 ...
##  $ REL_ROADNAME: chr  "On Roadside" "On Roadway" "On Roadside" "On Roadside" ...
##  $ LGT_COND    : int  2 3 1 1 2 2 3 1 2 1 ...
##  $ LGT_CONDNAME: chr  "Dark - Not Lighted" "Dark - Lighted" "Daylight" "Daylight" ...
##  $ WEATHER     : int  1 2 2 10 2 1 1 1 10 10 ...
##  $ WEATHERNAME : chr  "Clear" "Rain" "Rain" "Cloudy" ...
##  $ SCH_BUS     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ SCH_BUSNAME : chr  "No" "No" "No" "No" ...
##  $ RAIL        : chr  "0000000" "0000000" "0000000" "0000000" ...
##  $ RAILNAME    : chr  "Not Applicable" "Not Applicable" "Not Applicable" "Not Applicable" ...
##  $ NOT_HOUR    : int  99 17 14 99 0 17 19 7 20 10 ...
##  $ NOT_HOURNAME: chr  "Unknown" "5:00pm-5:59pm" "2:00pm-2:59pm" "Unknown" ...
##  $ NOT_MIN     : int  99 18 58 99 45 0 23 21 0 3 ...
##  $ NOT_MINNAME : chr  "Unknown" "18" "58" "Unknown" ...
##  $ ARR_HOUR    : int  3 17 15 99 0 17 19 7 20 10 ...
##  $ ARR_HOURNAME: chr  "3:00am-3:59am" "5:00pm-5:59pm" "3:00pm-3:59pm" "Unknown EMS Scene Arrival Hour" ...
##  $ ARR_MIN     : int  10 26 15 99 55 19 29 28 10 7 ...
##  $ ARR_MINNAME : chr  "10" "26" "15" "Unknown EMS Scene Arrival Minutes" ...
##  $ HOSP_HR     : int  99 99 99 99 88 18 88 88 99 10 ...
##  $ HOSP_HRNAME : chr  "Unknown" "Unknown" "Unknown" "Unknown" ...
##  $ HOSP_MN     : int  99 99 99 99 88 51 88 88 99 29 ...
##  $ HOSP_MNNAME : chr  "Unknown EMS Hospital Arrival Time" "Unknown EMS Hospital Arrival Time" "Unknown EMS Hospital Arrival Time" "Unknown EMS Hospital Arrival Time" ...
##  $ FATALS      : int  3 1 1 1 1 1 1 1 1 1 ...
##  $ DRUNK_DR    : int  1 0 0 0 0 0 0 0 0 0 ...

Although we already knew this from the guided example, we obtain the number of observations and variables.

num_observaciones <- nrow(accidentData)
num_variables <- ncol(accidentData)

cat("Número de observaciones:", num_observaciones, "\n")

## Número de observaciones: 35766

cat("Número de variables:", num_variables, "\n")

## Número de variables: 81

We continue following the initial guidelines from the guided example, as the steps remain the same. Now, we must review the variables and validate them with the documentation to prevent errors before starting our analysis. The variables are logically arranged to provide meaning, and we complete them with information in Spanish.

ST_CASE Accident identifier

FACTS TO STUDY

FATAL Fatalities
DRUNK_DR Drunk drivers
VE_TOTAL Total number of vehicles involved
VE_FORMS Number of moving vehicles involved
PVH_INVL Number of parked vehicles involved
PEDS Number of pedestrians involved
PERSONS Number of vehicle occupants involved
PERMVIT Number of drivers and occupants involved
PERNOTMVIT Number of pedestrians, cyclists, horse riders… Anything other than a motor vehicle

GEOGRAPHIC DIMENSION

STATE State code
STATENAME State name
COUNTY County identifier
COUNTYNAME County
CITY City identifier
CITYNAME City
NHS 1 if it is on an NHS highway, 0 if not
NHSNAME TBD
ROUTE Route identifier
ROUTENAME Route
TWAY_ID Transit way (1982)
TWAY_ID2 Transit way (2004)
RUR_URB Rural or urban segment identifier
RUR_URBNAME Rural or urban segment
FUNC_SYS Functional classification of the segment
FUNC_SYSNAME TBD
RD_OWNER Segment owner identifier
RD_OWNERNAME Segment owner
MILEPT Mile (integer)
MILEPTNAME Mile (character)
LATITUDE Latitude (integer)
LATITUDENAME Latitude (character)
LONGITUD Longitude (integer)
LONGITUDNAME Longitude (character)
SP_JUR Jurisdiction code
SP_JURNAME Jurisdiction

TEMPORAL DIMENSION

DAY Day
DAYNAME Repeated day
MONTH Month
MONTHNAME Month name
YEAR Year
DAY_WEEK Day of the week
DAY_WEEKNAME Day of the week name
HOUR Hour
HOURNAME Time range
MINUTE Minute (integer)
MINUTENAME Minute (character)

ACCIDENT CONDITIONS DIMENSION

HARM_EV Code for the first event of the accident causing damage or injury
HARM_EVNAME First event of the accident causing damage or injury
MAN_COLL Vehicle position code
MAN_COLLNAME Vehicle position
RELJCT1 Code indicating if there is an interchange area
RELJCT1NAME If there is an interchange area
RELJCT2 Proximity to intersection code
RELJCT2NAME Proximity to intersection
TYP_INT Intersection type code
TYP_INTNAME Intersection type
WRK_ZONE Work zone type code
WRK_ZONENAME Work zone type
RAIL_ROAD Vehicle location relative to railway code
RAIL_ROADNAME Vehicle location relative to railway
LGT_COND Lighting condition code
LGT_CONDNAME Lighting condition

METEOROLOGICAL DIMENSION

WEATHER Weather code
WEATHERNAME Weather

OTHER FACTORS

SCH_BUSS Code indicating if a school bus was involved
SCH_BUSNAME School bus involved
RAIL Code indicating if the accident occurred near or on a railway crossing
RAILNAME If near or on a railway crossing

EMERGENCY SERVICE DIMENSION

NOT_HOUR Emergency notification hour (integer)
NOT_HOURNAME Emergency notification time range
NOT_MIN Emergency notification minute (integer)
NOT_MINNAME Emergency notification minute (character)
ARR_HOUR Emergency arrival hour (integer)
ARR_HOURNAME Emergency arrival time range
ARR_MIN Emergency arrival minute (integer)
ARR_MINNAME Emergency arrival minute (character)
HOSP_HR Hospital arrival hour (integer)
HOSP_HRNAME Hospital arrival time range
HOSP_MN Hospital arrival minute (integer)
HOSP_MNNAME Hospital arrival minute (character)

ACCIDENT-RELATED FACTORS DIMENSION

CF1 Code for accident-related factor 1
CF1NAME Accident-related factor 1
CF2 Code for accident-related factor 2
CF2NAME Accident-related factor 2
CF3 Code for accident-related factor 3

Finally, before starting data preprocessing, we conduct an overview of the basic statistics of our dataset.

summary(accidentData)

##      STATE        STATENAME            ST_CASE          VE_TOTAL    
##  Min.   : 1.00   Length:35766       Min.   : 10001   Min.   : 1.00  
##  1st Qu.:12.00   Class :character   1st Qu.:122078   1st Qu.: 1.00  
##  Median :26.00   Mode  :character   Median :260917   Median : 1.00  
##  Mean   :27.16                      Mean   :272387   Mean   : 1.56  
##  3rd Qu.:42.00                      3rd Qu.:420477   3rd Qu.: 2.00  
##  Max.   :56.00                      Max.   :560115   Max.   :15.00  
##     VE_FORMS         PVH_INVL             PEDS           PERSONS      
##  Min.   : 1.000   Min.   : 0.00000   Min.   :0.0000   Min.   : 0.000  
##  1st Qu.: 1.000   1st Qu.: 0.00000   1st Qu.:0.0000   1st Qu.: 1.000  
##  Median : 1.000   Median : 0.00000   Median :0.0000   Median : 2.000  
##  Mean   : 1.517   Mean   : 0.04269   Mean   :0.2285   Mean   : 2.173  
##  3rd Qu.: 2.000   3rd Qu.: 0.00000   3rd Qu.:0.0000   3rd Qu.: 3.000  
##  Max.   :15.000   Max.   :10.00000   Max.   :8.0000   Max.   :61.000  
##     PERMVIT         PERNOTMVIT         COUNTY        COUNTYNAME       
##  Min.   : 0.000   Min.   :0.0000   Min.   :  1.00   Length:35766      
##  1st Qu.: 1.000   1st Qu.:0.0000   1st Qu.: 31.00   Class :character  
##  Median : 2.000   Median :0.0000   Median : 71.00   Mode  :character  
##  Mean   : 2.163   Mean   :0.2387   Mean   : 93.06                     
##  3rd Qu.: 3.000   3rd Qu.:0.0000   3rd Qu.:117.00                     
##  Max.   :61.000   Max.   :9.0000   Max.   :999.00                     
##       CITY        CITYNAME              DAY           DAYNAME     
##  Min.   :   0   Length:35766       Min.   : 1.00   Min.   : 1.00  
##  1st Qu.:   0   Class :character   1st Qu.: 8.00   1st Qu.: 8.00  
##  Median : 120   Mode  :character   Median :16.00   Median :16.00  
##  Mean   :1436                      Mean   :15.71   Mean   :15.71  
##  3rd Qu.:2080                      3rd Qu.:23.00   3rd Qu.:23.00  
##  Max.   :9999                      Max.   :31.00   Max.   :31.00  
##      MONTH         MONTHNAME              YEAR         DAY_WEEK    
##  Min.   : 1.000   Length:35766       Min.   :2020   Min.   :1.000  
##  1st Qu.: 4.000   Class :character   1st Qu.:2020   1st Qu.:2.000  
##  Median : 7.000   Mode  :character   Median :2020   Median :4.000  
##  Mean   : 6.898                      Mean   :2020   Mean   :4.114  
##  3rd Qu.:10.000                      3rd Qu.:2020   3rd Qu.:6.000  
##  Max.   :12.000                      Max.   :2020   Max.   :7.000  
##  DAY_WEEKNAME            HOUR         HOURNAME             MINUTE     
##  Length:35766       Min.   : 0.00   Length:35766       Min.   : 0.00  
##  Class :character   1st Qu.: 7.00   Class :character   1st Qu.:14.00  
##  Mode  :character   Median :15.00   Mode  :character   Median :30.00  
##                     Mean   :13.94                      Mean   :29.24  
##                     3rd Qu.:19.00                      3rd Qu.:45.00  
##                     Max.   :99.00                      Max.   :99.00  
##   MINUTENAME             NHS           NHSNAME              ROUTE      
##  Length:35766       Min.   :0.0000   Length:35766       Min.   :1.000  
##  Class :character   1st Qu.:0.0000   Class :character   1st Qu.:2.000  
##  Mode  :character   Median :0.0000   Mode  :character   Median :3.000  
##                     Mean   :0.5877                      Mean   :3.901  
##                     3rd Qu.:1.0000                      3rd Qu.:6.000  
##                     Max.   :9.0000                      Max.   :9.000  
##   ROUTENAME           TWAY_ID            TWAY_ID2            RUR_URB     
##  Length:35766       Length:35766       Length:35766       Min.   :1.000  
##  Class :character   Class :character   Class :character   1st Qu.:1.000  
##  Mode  :character   Mode  :character   Mode  :character   Median :2.000  
##                                                           Mean   :1.662  
##                                                           3rd Qu.:2.000  
##                                                           Max.   :9.000  
##  RUR_URBNAME           FUNC_SYS      FUNC_SYSNAME          RD_OWNER    
##  Length:35766       Min.   : 1.000   Length:35766       Min.   : 1.00  
##  Class :character   1st Qu.: 3.000   Class :character   1st Qu.: 1.00  
##  Mode  :character   Median : 4.000   Mode  :character   Median : 1.00  
##                     Mean   : 6.038                      Mean   :19.96  
##                     3rd Qu.: 5.000                      3rd Qu.: 4.00  
##                     Max.   :99.000                      Max.   :99.00  
##  RD_OWNERNAME           MILEPT         MILEPTNAME           LATITUDE     
##  Length:35766       Min.   :    0.0   Length:35766       Min.   : 19.09  
##  Class :character   1st Qu.:    2.0   Class :character   1st Qu.: 32.99  
##  Mode  :character   Median :   80.0   Mode  :character   Median : 36.17  
##                     Mean   :19990.7                      Mean   : 36.90  
##                     3rd Qu.:  955.5                      3rd Qu.: 40.45  
##                     Max.   :99999.0                      Max.   :100.00  
##  LATITUDENAME          LONGITUD       LONGITUDNAME           SP_JUR       
##  Length:35766       Min.   :-165.30   Length:35766       Min.   :0.00000  
##  Class :character   1st Qu.: -97.90   Class :character   1st Qu.:0.00000  
##  Mode  :character   Median : -87.81   Mode  :character   Median :0.00000  
##                     Mean   : -84.59                      Mean   :0.04029  
##                     3rd Qu.: -81.52                      3rd Qu.:0.00000  
##                     Max.   :1000.00                      Max.   :9.00000  
##   SP_JURNAME           HARM_EV      HARM_EVNAME           MAN_COLL     
##  Length:35766       Min.   : 1.00   Length:35766       Min.   : 0.000  
##  Class :character   1st Qu.: 8.00   Class :character   1st Qu.: 0.000  
##  Mode  :character   Median :12.00   Mode  :character   Median : 0.000  
##                     Mean   :18.31                      Mean   : 1.929  
##                     3rd Qu.:30.00                      3rd Qu.: 2.000  
##                     Max.   :99.00                      Max.   :99.000  
##  MAN_COLLNAME          RELJCT1        RELJCT1NAME           RELJCT2      
##  Length:35766       Min.   :0.00000   Length:35766       Min.   : 1.000  
##  Class :character   1st Qu.:0.00000   Class :character   1st Qu.: 1.000  
##  Mode  :character   Median :0.00000   Mode  :character   Median : 1.000  
##                     Mean   :0.07283                      Mean   : 2.368  
##                     3rd Qu.:0.00000                      3rd Qu.: 2.000  
##                     Max.   :9.00000                      Max.   :99.000  
##  RELJCT2NAME           TYP_INT       TYP_INTNAME           WRK_ZONE      
##  Length:35766       Min.   : 1.000   Length:35766       Min.   :0.00000  
##  Class :character   1st Qu.: 1.000   Class :character   1st Qu.:0.00000  
##  Mode  :character   Median : 1.000   Mode  :character   Median :0.00000  
##                     Mean   : 1.764                      Mean   :0.04748  
##                     3rd Qu.: 1.000                      3rd Qu.:0.00000  
##                     Max.   :99.000                      Max.   :4.00000  
##  WRK_ZONENAME          REL_ROAD      REL_ROADNAME          LGT_COND    
##  Length:35766       Min.   : 1.000   Length:35766       Min.   :1.000  
##  Class :character   1st Qu.: 1.000   Class :character   1st Qu.:1.000  
##  Mode  :character   Median : 1.000   Mode  :character   Median :2.000  
##                     Mean   : 2.557                      Mean   :1.961  
##                     3rd Qu.: 4.000                      3rd Qu.:3.000  
##                     Max.   :99.000                      Max.   :9.000  
##  LGT_CONDNAME          WEATHER       WEATHERNAME           SCH_BUS        
##  Length:35766       Min.   : 1.000   Length:35766       Min.   :0.000000  
##  Class :character   1st Qu.: 1.000   Class :character   1st Qu.:0.000000  
##  Mode  :character   Median : 1.000   Mode  :character   Median :0.000000  
##                     Mean   : 9.725                      Mean   :0.001426  
##                     3rd Qu.: 2.000                      3rd Qu.:0.000000  
##                     Max.   :99.000                      Max.   :1.000000  
##  SCH_BUSNAME            RAIL             RAILNAME            NOT_HOUR    
##  Length:35766       Length:35766       Length:35766       Min.   : 0.00  
##  Class :character   Class :character   Class :character   1st Qu.:16.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :99.00  
##                                                           Mean   :61.39  
##                                                           3rd Qu.:99.00  
##                                                           Max.   :99.00  
##  NOT_HOURNAME          NOT_MIN      NOT_MINNAME           ARR_HOUR    
##  Length:35766       Min.   : 0.00   Length:35766       Min.   : 0.00  
##  Class :character   1st Qu.:34.00   Class :character   1st Qu.:16.00  
##  Mode  :character   Median :98.00   Mode  :character   Median :99.00  
##                     Mean   :68.46                      Mean   :61.88  
##                     3rd Qu.:99.00                      3rd Qu.:99.00  
##                     Max.   :99.00                      Max.   :99.00  
##  ARR_HOURNAME          ARR_MIN      ARR_MINNAME           HOSP_HR     
##  Length:35766       Min.   : 0.00   Length:35766       Min.   : 0.00  
##  Class :character   1st Qu.:34.00   Class :character   1st Qu.:88.00  
##  Mode  :character   Median :98.00   Mode  :character   Median :88.00  
##                     Mean   :68.74                      Mean   :77.59  
##                     3rd Qu.:99.00                      3rd Qu.:99.00  
##                     Max.   :99.00                      Max.   :99.00  
##  HOSP_HRNAME           HOSP_MN      HOSP_MNNAME            FATALS     
##  Length:35766       Min.   : 0.00   Length:35766       Min.   :1.000  
##  Class :character   1st Qu.:88.00   Class :character   1st Qu.:1.000  
##  Mode  :character   Median :88.00   Mode  :character   Median :1.000  
##                     Mean   :80.76                      Mean   :1.085  
##                     3rd Qu.:99.00                      3rd Qu.:1.000  
##                     Max.   :99.00                      Max.   :8.000  
##     DRUNK_DR     
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.2664  
##  3rd Qu.:1.0000  
##  Max.   :4.0000

Preprocessing

In this phase, we will transform our dataset into a new one by removing all information that we consider erroneous or of little value, allowing us to begin our analysis.

Logically, we will check for missing values in both forms.

missing_values <- colSums(is.na(accidentData)) + colSums(accidentData == "")
print(missing_values)

##        STATE    STATENAME      ST_CASE     VE_TOTAL     VE_FORMS     PVH_INVL 
##            0            0            0            0            0            0 
##         PEDS      PERSONS      PERMVIT   PERNOTMVIT       COUNTY   COUNTYNAME 
##            0            0            0            0            0            0 
##         CITY     CITYNAME          DAY      DAYNAME        MONTH    MONTHNAME 
##            0            0            0            0            0            0 
##         YEAR     DAY_WEEK DAY_WEEKNAME         HOUR     HOURNAME       MINUTE 
##            0            0            0            0            0            0 
##   MINUTENAME          NHS      NHSNAME        ROUTE    ROUTENAME      TWAY_ID 
##            0            0            0            0            0            0 
##     TWAY_ID2      RUR_URB  RUR_URBNAME     FUNC_SYS FUNC_SYSNAME     RD_OWNER 
##        26997            0            0            0            0            0 
## RD_OWNERNAME       MILEPT   MILEPTNAME     LATITUDE LATITUDENAME     LONGITUD 
##            0            0            0            0            0            0 
## LONGITUDNAME       SP_JUR   SP_JURNAME      HARM_EV  HARM_EVNAME     MAN_COLL 
##            0            0            0            0            0            0 
## MAN_COLLNAME      RELJCT1  RELJCT1NAME      RELJCT2  RELJCT2NAME      TYP_INT 
##            0            0            0            0            0            0 
##  TYP_INTNAME     WRK_ZONE WRK_ZONENAME     REL_ROAD REL_ROADNAME     LGT_COND 
##            0            0            0            0            0            0 
## LGT_CONDNAME      WEATHER  WEATHERNAME      SCH_BUS  SCH_BUSNAME         RAIL 
##            0            0            0            0            0            0 
##     RAILNAME     NOT_HOUR NOT_HOURNAME      NOT_MIN  NOT_MINNAME     ARR_HOUR 
##            0            0            0            0            0            0 
## ARR_HOURNAME      ARR_MIN  ARR_MINNAME      HOSP_HR  HOSP_HRNAME      HOSP_MN 
##            0            0            0            0            0            0 
##  HOSP_MNNAME       FATALS     DRUNK_DR 
##            0            0            0

We observe that the variable TWAY_ID has 26,997 missing values, so we will analyze this variable.

str(accidentData$TWAY_ID2)

##  chr [1:35766] "" "" "us-280" "" "" "" "17TH ST" "" "" "" "cr-400" ...

missing_tway_id2 <- sum(is.na(accidentData$TWAY_ID2))
cat("Number of missing values in TWAY_ID2:", missing_tway_id2, "\n")

## Number of missing values in TWAY_ID2: 0

We understand that data for these roads is unavailable for some reason. However, in my emergency analysis, which will be presented later, it is essential to consider the road as an additional variable to interpret the results.

Therefore, we will account for the transformation of this variable. To address this issue, we will fill the missing values with "notspecify" and verify that this procedure is correctly applied.

accidentData$TWAY_ID2 <- ifelse(accidentData$TWAY_ID2 == "", "notspecify", accidentData$TWAY_ID2)
missing_values <- colSums(is.na(accidentData)) + colSums(accidentData == "")
print(missing_values)

##        STATE    STATENAME      ST_CASE     VE_TOTAL     VE_FORMS     PVH_INVL 
##            0            0            0            0            0            0 
##         PEDS      PERSONS      PERMVIT   PERNOTMVIT       COUNTY   COUNTYNAME 
##            0            0            0            0            0            0 
##         CITY     CITYNAME          DAY      DAYNAME        MONTH    MONTHNAME 
##            0            0            0            0            0            0 
##         YEAR     DAY_WEEK DAY_WEEKNAME         HOUR     HOURNAME       MINUTE 
##            0            0            0            0            0            0 
##   MINUTENAME          NHS      NHSNAME        ROUTE    ROUTENAME      TWAY_ID 
##            0            0            0            0            0            0 
##     TWAY_ID2      RUR_URB  RUR_URBNAME     FUNC_SYS FUNC_SYSNAME     RD_OWNER 
##            0            0            0            0            0            0 
## RD_OWNERNAME       MILEPT   MILEPTNAME     LATITUDE LATITUDENAME     LONGITUD 
##            0            0            0            0            0            0 
## LONGITUDNAME       SP_JUR   SP_JURNAME      HARM_EV  HARM_EVNAME     MAN_COLL 
##            0            0            0            0            0            0 
## MAN_COLLNAME      RELJCT1  RELJCT1NAME      RELJCT2  RELJCT2NAME      TYP_INT 
##            0            0            0            0            0            0 
##  TYP_INTNAME     WRK_ZONE WRK_ZONENAME     REL_ROAD REL_ROADNAME     LGT_COND 
##            0            0            0            0            0            0 
## LGT_CONDNAME      WEATHER  WEATHERNAME      SCH_BUS  SCH_BUSNAME         RAIL 
##            0            0            0            0            0            0 
##     RAILNAME     NOT_HOUR NOT_HOURNAME      NOT_MIN  NOT_MINNAME     ARR_HOUR 
##            0            0            0            0            0            0 
## ARR_HOURNAME      ARR_MIN  ARR_MINNAME      HOSP_HR  HOSP_HRNAME      HOSP_MN 
##            0            0            0            0            0            0 
##  HOSP_MNNAME       FATALS     DRUNK_DR 
##            0            0            0

Before starting the analysis, we will focus on fatalities, set a global context, and humanize the topic we are addressing—fatalities and factors such as the most dangerous day of the week and the time of occurrence.

total_fatalities <- sum(accidentData$FATALS, na.rm = TRUE)
cat("Total number of fatalities:", total_fatalities, "\n")

## Total number of fatalities: 38824

fatalities_by_day <- accidentData %>%
  group_by(DAY_WEEKNAME) %>%
  summarise(Total_Fatalities = sum(FATALS, na.rm = TRUE)) %>%
  arrange(desc(Total_Fatalities))

cat("\nFatalities by day of the week:\n")

## 
## Fatalities by day of the week:

print(fatalities_by_day)

## # A tibble: 7 × 2
##   DAY_WEEKNAME Total_Fatalities
##   <chr>                   <int>
## 1 Saturday                 6712
## 2 Sunday                   6114
## 3 Friday                   6026
## 4 Thursday                 5221
## 5 Wednesday                5055
## 6 Tuesday                  4858
## 7 Monday                   4838

fatalities_by_hour <- accidentData %>%
  group_by(HOURNAME) %>%
  summarise(Total_Fatalities = sum(FATALS, na.rm = TRUE)) %>%
  arrange(desc(Total_Fatalities))

cat("\nFatalities by hour of the day:\n")

## 
## Fatalities by hour of the day:

print(fatalities_by_hour)

## # A tibble: 25 × 2
##    HOURNAME        Total_Fatalities
##    <chr>                      <int>
##  1 9:00pm-9:59pm               2357
##  2 6:00pm-6:59pm               2356
##  3 8:00pm-8:59pm               2343
##  4 7:00pm-7:59pm               2314
##  5 5:00pm-5:59pm               2133
##  6 10:00pm-10:59pm             2055
##  7 4:00pm-4:59pm               1997
##  8 3:00pm-3:59pm               1954
##  9 11:00pm-11:59pm             1879
## 10 2:00pm-2:59pm               1734
## # ℹ 15 more rows

It is always important to consider the margin of error, which in this case is 313 over the total.

unknown_hours_count <- sum(accidentData$HOURNAME == "Unknown Hours", na.rm = TRUE)
cat("Number of records with 'Unknown Hours':", unknown_hours_count, "\n")

## Number of records with 'Unknown Hours': 313

In this accident analysis, we found a total of 38,824 fatalities, which highlights a serious issue with road safety. Looking at the days of the week, Saturday is the most dangerous with 6,712 fatalities, followed closely by Sunday with 6,114 and Friday with 6,026. This suggests that weekends, when people tend to go out more, may be linked to a higher risk, possibly due to alcohol consumption.

Regarding the hours of the day, the most critical times are at night, with 9:00 PM to 9:59 PM recording the highest number of fatalities at 2,357, followed by 6:00 PM to 6:59 PM with 2,356 fatalities. This indicates that nighttime is particularly dangerous on the roads.

It is clear that action is needed to improve road safety, especially during weekends and these critical hours.

Emergency Analysis

After analyzing and transforming the dataset, we will focus our analysis on the emergency-related variables.

EMERGENCY SERVICE DIMENSION

NOT_HOUR Emergency notification hour (integer)
NOT_HOURNAME Emergency notification time range
NOT_MIN Emergency notification minute (integer)
NOT_MINNAME Emergency notification minute (character)
ARR_HOUR Emergency arrival hour (integer)
ARR_HOURNAME Emergency arrival time range
ARR_MIN Emergency arrival minute (integer)
ARR_MINNAME Emergency arrival minute (character)
HOSP_HR Hospital arrival hour (integer)
HOSP_HRNAME Hospital arrival time range
HOSP_MN Hospital arrival minute (integer)
HOSP_MNNAME Hospital arrival minute (character)

Bibliography

https://rsanchezs.gitbooks.io/rprogramming/content/chapter9/mutate.html “Mutate: Calculation in Minutes”

accidentData <- accidentData %>%
  mutate(
    Tiempo_Respuesta = (ARR_HOUR * 60 + ARR_MIN) - (NOT_HOUR * 60 + NOT_MIN),
    Tiempo_Hasta_Hospital = (HOSP_HR * 60 + HOSP_MN) - (ARR_HOUR * 60 + ARR_MIN)
  )
head(accidentData %>% select(NOT_HOUR, NOT_MIN, ARR_HOUR, ARR_MIN, Tiempo_Respuesta))

##   NOT_HOUR NOT_MIN ARR_HOUR ARR_MIN Tiempo_Respuesta
## 1       99      99        3      10            -5849
## 2       17      18       17      26                8
## 3       14      58       15      15               17
## 4       99      99       99      99                0
## 5        0      45        0      55               10
## 6       17       0       17      19               19

We observe that these data contain null values identified as 99, so we will replace them and include them as NA.

accidentData <- accidentData %>%
  mutate(
    NOT_HOUR = ifelse(NOT_HOUR == 99, NA, NOT_HOUR),
    NOT_MIN = ifelse(NOT_MIN == 99, NA, NOT_MIN),
    
    ARR_HOUR = ifelse(ARR_HOUR == 99, NA, ARR_HOUR),
    ARR_MIN = ifelse(ARR_MIN == 99, NA, ARR_MIN),
    
    HOSP_HR = ifelse(HOSP_HR == 99, NA, HOSP_HR),
    HOSP_MN = ifelse(HOSP_MN == 99, NA, HOSP_MN)
  )

hour_columns <- accidentData[, c("NOT_HOUR", "NOT_MIN", "ARR_HOUR", "ARR_MIN", "HOSP_HR", "HOSP_MN")]
na_summary_hours <- colSums(is.na(hour_columns))
na_summary_hours

## NOT_HOUR  NOT_MIN ARR_HOUR  ARR_MIN  HOSP_HR  HOSP_MN 
##    19520    16888    19724    16967    15201    14967

na_percentage_hours <- (na_summary_hours / nrow(accidentData)) * 100
na_percentage_hours

## NOT_HOUR  NOT_MIN ARR_HOUR  ARR_MIN  HOSP_HR  HOSP_MN 
## 54.57697 47.21803 55.14735 47.43891 42.50126 41.84701

The percentages are so high that the analysis would be inaccurate. We need to take a different approach, such as analyzing time ranges. However, it is evident that even this cannot be used due to the presence of missing values.

incident_counts <- accidentData %>%
  group_by(NOT_HOURNAME) %>%
  summarise(Incident_Count = n())

ggplot(incident_counts, aes(x = NOT_HOURNAME, y = Incident_Count)) +
  geom_bar(stat = "identity") +
  labs(title = "Number of Incidents by Notification Time Range",
       x = "Notification Time Range",
       y = "Number of Incidents") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

unknown_counts <- accidentData %>%
  summarise(
    NOT_HOURNAME_Unknown = sum(NOT_HOURNAME == "Unknown", na.rm = TRUE),
    ARR_HOURNAME_Unknown = sum(ARR_HOURNAME == "Unknown", na.rm = TRUE),
    ARR_MINNAME_Unknown = sum(ARR_MINNAME == "Unknown", na.rm = TRUE)
  )

print(unknown_counts)

##   NOT_HOURNAME_Unknown ARR_HOURNAME_Unknown ARR_MINNAME_Unknown
## 1                19520                    0                   0

accidentData <- accidentData %>%
  select(-Tiempo_Respuesta, -Tiempo_Hasta_Hospital)

emergency_variables <- accidentData %>%
  select(NOT_HOUR, NOT_MIN, ARR_HOUR, ARR_MIN, HOSP_HR, HOSP_MN)

summary_stats <- summary(emergency_variables)

na_counts <- sapply(emergency_variables, function(x) sum(is.na(x)))

summary_with_na <- list(
  summary_stats = summary_stats,
  na_counts = na_counts
)

print(summary_with_na)

## $summary_stats
##     NOT_HOUR        NOT_MIN         ARR_HOUR        ARR_MIN     
##  Min.   : 0.00   Min.   : 0.00   Min.   : 0.00   Min.   : 0.00  
##  1st Qu.: 8.00   1st Qu.:18.00   1st Qu.: 8.00   1st Qu.:18.00  
##  Median :15.00   Median :36.00   Median :15.00   Median :36.00  
##  Mean   :16.21   Mean   :41.13   Mean   :16.25   Mean   :41.42  
##  3rd Qu.:19.00   3rd Qu.:54.00   3rd Qu.:19.00   3rd Qu.:54.00  
##  Max.   :88.00   Max.   :98.00   Max.   :88.00   Max.   :98.00  
##  NA's   :19520   NA's   :16888   NA's   :19724   NA's   :16967  
##     HOSP_HR         HOSP_MN     
##  Min.   : 0.00   Min.   : 0.00  
##  1st Qu.:18.00   1st Qu.:42.00  
##  Median :88.00   Median :88.00  
##  Mean   :61.76   Mean   :67.64  
##  3rd Qu.:88.00   3rd Qu.:88.00  
##  Max.   :88.00   Max.   :98.00  
##  NA's   :15201   NA's   :14967  
## 
## $na_counts
## NOT_HOUR  NOT_MIN ARR_HOUR  ARR_MIN  HOSP_HR  HOSP_MN 
##    19520    16888    19724    16967    15201    14967

Emergency Conclusion: There is a significant number of missing values across all variables, especially in NOT_HOUR, ARR_HOUR, and ARR_MIN, where more than 50% of the data is missing. This indicates that alternative variables should be used or that we should check if these data have improved in the 2022 dataset.

path2 = 'accident2022.CSV'
accidentData2022 <- read.csv(path2, row.names=NULL)
structure = str(accidentData2022)

## 'data.frame':    13047 obs. of  80 variables:
##  $ STATE       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ STATENAME   : chr  "Alabama" "Alabama" "Alabama" "Alabama" ...
##  $ ST_CASE     : int  10001 10002 10003 10004 10005 10006 10007 10008 10009 10010 ...
##  $ PEDS        : int  0 0 0 0 1 1 0 0 0 0 ...
##  $ PERNOTMVIT  : int  0 0 0 0 1 1 0 0 0 0 ...
##  $ VE_TOTAL    : int  2 2 1 1 1 1 2 1 2 1 ...
##  $ VE_FORMS    : int  2 2 1 1 1 1 2 1 2 1 ...
##  $ PVH_INVL    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PERSONS     : int  3 5 2 1 1 5 1 1 3 1 ...
##  $ PERMVIT     : int  3 5 2 1 1 5 1 1 3 1 ...
##  $ COUNTY      : int  107 101 115 101 73 101 63 101 71 131 ...
##  $ COUNTYNAME  : chr  "PICKENS (107)" "MONTGOMERY (101)" "ST. CLAIR (115)" "MONTGOMERY (101)" ...
##  $ CITY        : int  0 0 0 0 0 2130 0 0 0 0 ...
##  $ CITYNAME    : chr  "NOT APPLICABLE" "NOT APPLICABLE" "NOT APPLICABLE" "NOT APPLICABLE" ...
##  $ MONTH       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ MONTHNAME   : chr  "January" "January" "January" "January" ...
##  $ DAY         : int  1 1 1 2 2 2 4 4 4 5 ...
##  $ DAYNAME     : int  1 1 1 2 2 2 4 4 4 5 ...
##  $ DAY_WEEK    : int  7 7 7 1 1 1 3 3 3 4 ...
##  $ DAY_WEEKNAME: chr  "Saturday" "Saturday" "Saturday" "Sunday" ...
##  $ YEAR        : int  2022 2022 2022 2022 2022 2022 2022 2022 2022 2022 ...
##  $ HOUR        : int  12 16 1 14 18 18 9 14 11 0 ...
##  $ HOURNAME    : chr  "12:00pm-12:59pm" "4:00pm-4:59pm" "1:00am-1:59am" "2:00pm-2:59pm" ...
##  $ MINUTE      : int  30 40 33 46 48 28 5 50 40 0 ...
##  $ MINUTENAME  : chr  "30" "40" "33" "46" ...
##  $ TWAY_ID     : chr  "US-82 SR-6" "US-231 SR-53" "CR-KELLY CREEK RD" "I-65" ...
##  $ TWAY_ID2    : chr  "" "" "" "" ...
##  $ ROUTE       : int  2 2 4 1 1 6 4 4 2 3 ...
##  $ ROUTENAME   : chr  "US Highway" "US Highway" "County Road" "Interstate" ...
##  $ RUR_URB     : int  1 1 1 1 2 2 1 1 1 1 ...
##  $ RUR_URBNAME : chr  "Rural" "Rural" "Rural" "Rural" ...
##  $ FUNC_SYS    : int  3 3 5 1 1 4 5 5 3 3 ...
##  $ FUNC_SYSNAME: chr  "Principal Arterial - Other" "Principal Arterial - Other" "Major Collector" "Interstate" ...
##  $ RD_OWNER    : int  1 1 2 1 1 4 2 2 1 1 ...
##  $ RD_OWNERNAME: chr  "State Highway Agency" "State Highway Agency" "County Highway Agency" "State Highway Agency" ...
##  $ NHS         : int  1 1 0 1 1 0 0 0 1 1 ...
##  $ NHSNAME     : chr  "This section IS ON the NHS" "This section IS ON the NHS" "This section IS NOT on the NHS" "This section IS ON the NHS" ...
##  $ SP_JUR      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ SP_JURNAME  : chr  "No Special Jurisdiction" "No Special Jurisdiction" "No Special Jurisdiction" "No Special Jurisdiction" ...
##  $ MILEPT      : int  4 974 0 1595 1342 0 0 0 1243 52 ...
##  $ MILEPTNAME  : chr  "4" "974" "None" "1595" ...
##  $ LATITUDE    : num  33.5 32.1 33.4 32.2 33.5 ...
##  $ LATITUDENAME: num  33.5 32.1 33.4 32.2 33.5 ...
##  $ LONGITUD    : num  -88.3 -86.1 -86.4 -86.4 -86.7 ...
##  $ LONGITUDNAME: num  -88.3 -86.1 -86.4 -86.4 -86.7 ...
##  $ HARM_EV     : int  12 12 42 34 8 8 12 38 12 42 ...
##  $ HARM_EVNAME : chr  "Motor Vehicle In-Transport" "Motor Vehicle In-Transport" "Tree (Standing Only)" "Ditch" ...
##  $ MAN_COLL    : int  7 2 0 0 0 0 1 0 6 0 ...
##  $ MAN_COLLNAME: chr  "Sideswipe - Same Direction" "Front-to-Front" "The First Harmful Event was Not a Collision with a Motor Vehicle in Transport" "The First Harmful Event was Not a Collision with a Motor Vehicle in Transport" ...
##  $ RELJCT1     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ RELJCT1NAME : chr  "No" "No" "No" "No" ...
##  $ RELJCT2     : int  1 1 1 1 1 1 1 1 2 1 ...
##  $ RELJCT2NAME : chr  "Non-Junction" "Non-Junction" "Non-Junction" "Non-Junction" ...
##  $ TYP_INT     : int  1 1 1 1 1 1 1 1 2 1 ...
##  $ TYP_INTNAME : chr  "Not an Intersection" "Not an Intersection" "Not an Intersection" "Not an Intersection" ...
##  $ REL_ROAD    : int  1 1 4 4 2 1 1 4 1 4 ...
##  $ REL_ROADNAME: chr  "On Roadway" "On Roadway" "On Roadside" "On Roadside" ...
##  $ WRK_ZONE    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ WRK_ZONENAME: chr  "None" "None" "None" "None" ...
##  $ LGT_COND    : int  1 1 2 1 2 3 1 1 1 2 ...
##  $ LGT_CONDNAME: chr  "Daylight" "Daylight" "Dark - Not Lighted" "Daylight" ...
##  $ WEATHER     : int  1 1 10 10 2 1 1 1 1 1 ...
##  $ WEATHERNAME : chr  "Clear" "Clear" "Cloudy" "Cloudy" ...
##  $ SCH_BUS     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ SCH_BUSNAME : chr  "No" "No" "No" "No" ...
##  $ RAIL        : chr  "0000000" "0000000" "0000000" "0000000" ...
##  $ RAILNAME    : chr  "Not Applicable" "Not Applicable" "Not Applicable" "Not Applicable" ...
##  $ NOT_HOUR    : int  12 99 1 14 18 18 99 99 11 0 ...
##  $ NOT_HOURNAME: chr  "12:00pm-12:59pm" "Unknown" "1:00am-1:59am" "2:00pm-2:59pm" ...
##  $ NOT_MIN     : int  47 99 33 48 48 26 99 99 36 0 ...
##  $ NOT_MINNAME : chr  "47" "Unknown" "33" "48" ...
##  $ ARR_HOUR    : int  13 99 1 15 18 18 99 99 11 0 ...
##  $ ARR_HOURNAME: chr  "1:00pm-1:59pm" "Unknown EMS Scene Arrival Hour" "1:00am-1:59am" "3:00pm-3:59pm" ...
##  $ ARR_MIN     : int  4 99 50 9 54 32 99 99 54 33 ...
##  $ ARR_MINNAME : chr  "4" "Unknown EMS Scene Arrival Minutes" "50" "9" ...
##  $ HOSP_HR     : int  13 99 99 15 88 99 88 99 12 88 ...
##  $ HOSP_HRNAME : chr  "1:00pm-1:59pm" "Unknown" "Unknown" "3:00pm-3:59pm" ...
##  $ HOSP_MN     : int  47 99 99 44 88 99 88 99 41 88 ...
##  $ HOSP_MNNAME : chr  "47" "Unknown EMS Hospital Arrival Time" "Unknown EMS Hospital Arrival Time" "44" ...
##  $ FATALS      : int  1 2 1 1 1 1 1 1 1 1 ...

accidentData2022 <- accidentData2022 %>%
  mutate(
    NOT_HOUR = ifelse(NOT_HOUR == 99, NA, NOT_HOUR),
    NOT_MIN = ifelse(NOT_MIN == 99, NA, NOT_MIN),
    
    ARR_HOUR = ifelse(ARR_HOUR == 99, NA, ARR_HOUR),
    ARR_MIN = ifelse(ARR_MIN == 99, NA, ARR_MIN),
    
    HOSP_HR = ifelse(HOSP_HR == 99, NA, HOSP_HR),
    HOSP_MN = ifelse(HOSP_MN == 99, NA, HOSP_MN)
  )

hour_columns <- accidentData2022[, c("NOT_HOUR", "NOT_MIN", "ARR_HOUR", "ARR_MIN", "HOSP_HR", "HOSP_MN")]
na_summary_hours <- colSums(is.na(hour_columns))
print(na_summary_hours)

## NOT_HOUR  NOT_MIN ARR_HOUR  ARR_MIN  HOSP_HR  HOSP_MN 
##     7165     7723     7481     7973     5787     5752

na_percentage_hours <- (na_summary_hours / nrow(accidentData2022)) * 100
print(na_percentage_hours)

## NOT_HOUR  NOT_MIN ARR_HOUR  ARR_MIN  HOSP_HR  HOSP_MN 
## 54.91684 59.19368 57.33885 61.10983 44.35502 44.08676

During our data exploration and upon revisiting the documentation, we found that 88 corresponds to “Not Applicable or Not Notified”. We will include this value in the percentage calculations.

accidentData2022 <- accidentData2022 %>%
  mutate(
    NOT_HOUR = ifelse(NOT_HOUR %in% c(99, 88), NA, NOT_HOUR),
    NOT_MIN = ifelse(NOT_MIN %in% c(99, 88), NA, NOT_MIN),
    
    ARR_HOUR = ifelse(ARR_HOUR %in% c(99, 88), NA, ARR_HOUR),
    ARR_MIN = ifelse(ARR_MIN %in% c(99, 88), NA, ARR_MIN),
    
    HOSP_HR = ifelse(HOSP_HR %in% c(99, 88), NA, HOSP_HR),
    HOSP_MN = ifelse(HOSP_MN %in% c(99, 88), NA, HOSP_MN)
  )

hour_columns <- accidentData2022[, c("NOT_HOUR", "NOT_MIN", "ARR_HOUR", "ARR_MIN", "HOSP_HR", "HOSP_MN")]

na_summary_hours <- colSums(is.na(hour_columns))
print(na_summary_hours)

## NOT_HOUR  NOT_MIN ARR_HOUR  ARR_MIN  HOSP_HR  HOSP_MN 
##     7224     7782     7540     8032    11011    10976

na_percentage_hours <- (na_summary_hours / nrow(accidentData2022)) * 100
print(na_percentage_hours)

## NOT_HOUR  NOT_MIN ARR_HOUR  ARR_MIN  HOSP_HR  HOSP_MN 
## 55.36905 59.64590 57.79106 61.56204 84.39488 84.12662

The percentage values remain very high, even when using a dataset from more recent years. This indicates that the focus should be on data retention, as there are many unknown values.

EXTRA - Multiple Imputation

An attempt is made to perform multiple imputation based on the following reference:

https://rpubs.com/ydmarinb/429757

We begin the multiple imputation process by applying the technique described in the reference, focusing on the relevant columns. First, we replace the values 99 and 88, which are considered unknown data, with NA to facilitate imputation.

We use the mice function to perform the imputation, setting the number of imputations to 30. This ensures that the missing data in the notification and emergency arrival columns are adequately addressed.

Finally, visualizations are generated to analyze the relationships between notification and arrival times, allowing for a deeper examination of the efficiency of the emergency response system.

columns <- c("NOT_HOUR", "NOT_MIN", "ARR_HOUR", "ARR_MIN", "HOSP_HR", "HOSP_MN")
accidentData <- accidentData %>%
  mutate(across(all_of(columns), ~ replace(., . %in% c(99, 88), NA)))
set.seed(2018)  
imputed_data3 <- mice(accidentData[, names(accidentData) %in% columns], m = 30, print = FALSE)
complete.data3 <- mice::complete(imputed_data3)
xyplot(imputed_data3, ARR_HOUR ~ NOT_HOUR)

We will create a new dataset with the imputed values.

complete.data3 <- mice::complete(imputed_data3)
for (column in columns) {
  accidentData[[column]] <- complete.data3[[column]]
}
accidentData_imputed <- accidentData

head(accidentData_imputed)

##   STATE STATENAME ST_CASE VE_TOTAL VE_FORMS PVH_INVL PEDS PERSONS PERMVIT
## 1     1   Alabama   10001        1        1        0    0       4       4
## 2     1   Alabama   10002        4        4        0    0       6       6
## 3     1   Alabama   10003        2        2        0    0       2       2
## 4     1   Alabama   10004        1        1        0    0       5       5
## 5     1   Alabama   10005        1        1        0    0       1       1
## 6     1   Alabama   10006        2        2        0    0       3       3
##   PERNOTMVIT COUNTY     COUNTYNAME CITY       CITYNAME DAY DAYNAME MONTH
## 1          0     51    ELMORE (51)    0 NOT APPLICABLE   1       1     1
## 2          0     73 JEFFERSON (73)  350     BIRMINGHAM   2       2     1
## 3          0    117   SHELBY (117)    0 NOT APPLICABLE   2       2     1
## 4          0     15   CALHOUN (15)    0 NOT APPLICABLE   3       3     1
## 5          0     37     COOSA (37)    0 NOT APPLICABLE   4       4     1
## 6          0    103   MORGAN (103)    0 NOT APPLICABLE   4       4     1
##   MONTHNAME YEAR DAY_WEEK DAY_WEEKNAME HOUR      HOURNAME MINUTE MINUTENAME NHS
## 1   January 2020        4    Wednesday    2 2:00am-2:59am     58         58   0
## 2   January 2020        5     Thursday   17 5:00pm-5:59pm     18         18   0
## 3   January 2020        5     Thursday   14 2:00pm-2:59pm     55         55   0
## 4   January 2020        6       Friday   15 3:00pm-3:59pm     20         20   0
## 5   January 2020        7     Saturday    0 0:00am-0:59am     45         45   0
## 6   January 2020        7     Saturday   16 4:00pm-4:59pm     55         55   0
##                          NHSNAME ROUTE                   ROUTENAME
## 1 This section IS NOT on the NHS     4                 County Road
## 2 This section IS NOT on the NHS     6 Local Street - Municipality
## 3 This section IS NOT on the NHS     3               State Highway
## 4 This section IS NOT on the NHS     4                 County Road
## 5 This section IS NOT on the NHS     4                 County Road
## 6 This section IS NOT on the NHS     3               State Highway
##                       TWAY_ID   TWAY_ID2 RUR_URB RUR_URBNAME FUNC_SYS
## 1                        cr-4 notspecify       1       Rural        5
## 2    martin luther king jr dr notspecify       2       Urban        4
## 3                       sr-76     us-280       1       Rural        4
## 4 CR-ALEXANDRIA WELLINGTON RD notspecify       1       Rural        7
## 5                       CR-63 notspecify       1       Rural        5
## 6                       sr-36 notspecify       1       Rural        4
##      FUNC_SYSNAME RD_OWNER                     RD_OWNERNAME MILEPT MILEPTNAME
## 1 Major Collector        2            County Highway Agency      0       None
## 2  Minor Arterial        4 City or Municipal Highway Agency      0       None
## 3  Minor Arterial        1             State Highway Agency     49         49
## 4           Local        2            County Highway Agency      0       None
## 5 Major Collector        2            County Highway Agency      0       None
## 6  Minor Arterial        1             State Highway Agency    390        390
##   LATITUDE LATITUDENAME  LONGITUD LONGITUDNAME SP_JUR              SP_JURNAME
## 1 32.43313  32.43313333 -86.09485    -86.09485      0 No Special Jurisdiction
## 2 33.48466  33.48465833 -86.83954 -86.83954444      0 No Special Jurisdiction
## 3 33.29994  33.29994167 -86.36964 -86.36964167      0 No Special Jurisdiction
## 4 33.79507  33.79507222 -85.88349 -85.88348611      0 No Special Jurisdiction
## 5 32.84841  32.84841389 -86.08355 -86.08354722      0 No Special Jurisdiction
## 6 34.50894  34.50894167 -86.67486 -86.67485556      0 No Special Jurisdiction
##   HARM_EV                HARM_EVNAME MAN_COLL
## 1      42       Tree (Standing Only)        0
## 2      12 Motor Vehicle In-Transport        6
## 3      34                      Ditch        0
## 4      42       Tree (Standing Only)        0
## 5      42       Tree (Standing Only)        0
## 6      12 Motor Vehicle In-Transport        2
##                                                                    MAN_COLLNAME
## 1 The First Harmful Event was Not a Collision with a Motor Vehicle in Transport
## 2                                                                         Angle
## 3 The First Harmful Event was Not a Collision with a Motor Vehicle in Transport
## 4 The First Harmful Event was Not a Collision with a Motor Vehicle in Transport
## 5 The First Harmful Event was Not a Collision with a Motor Vehicle in Transport
## 6                                                                Front-to-Front
##   RELJCT1 RELJCT1NAME RELJCT2          RELJCT2NAME TYP_INT         TYP_INTNAME
## 1       0          No       1         Non-Junction       1 Not an Intersection
## 2       0          No       1         Non-Junction       1 Not an Intersection
## 3       0          No       3 Intersection-Related       3      T-Intersection
## 4       0          No       1         Non-Junction       1 Not an Intersection
## 5       0          No       1         Non-Junction       1 Not an Intersection
## 6       0          No       1         Non-Junction       1 Not an Intersection
##   WRK_ZONE WRK_ZONENAME REL_ROAD REL_ROADNAME LGT_COND       LGT_CONDNAME
## 1        0         None        4  On Roadside        2 Dark - Not Lighted
## 2        0         None        1   On Roadway        3     Dark - Lighted
## 3        0         None        4  On Roadside        1           Daylight
## 4        0         None        4  On Roadside        1           Daylight
## 5        0         None        4  On Roadside        2 Dark - Not Lighted
## 6        0         None        1   On Roadway        2 Dark - Not Lighted
##   WEATHER WEATHERNAME SCH_BUS SCH_BUSNAME    RAIL       RAILNAME NOT_HOUR
## 1       1       Clear       0          No 0000000 Not Applicable        3
## 2       2        Rain       0          No 0000000 Not Applicable       17
## 3       2        Rain       0          No 0000000 Not Applicable       14
## 4      10      Cloudy       0          No 0000000 Not Applicable       18
## 5       2        Rain       0          No 0000000 Not Applicable        0
## 6       1       Clear       0          No 0000000 Not Applicable       17
##    NOT_HOURNAME NOT_MIN NOT_MINNAME ARR_HOUR                   ARR_HOURNAME
## 1       Unknown      12     Unknown        3                  3:00am-3:59am
## 2 5:00pm-5:59pm      18          18       17                  5:00pm-5:59pm
## 3 2:00pm-2:59pm      58          58       15                  3:00pm-3:59pm
## 4       Unknown      43     Unknown       18 Unknown EMS Scene Arrival Hour
## 5 0:00am-0:59am      45          45        0                  0:00am-0:59am
## 6 5:00pm-5:59pm       0           0       17                  5:00pm-5:59pm
##   ARR_MIN                       ARR_MINNAME HOSP_HR
## 1      10                                10       3
## 2      26                                26      17
## 3      15                                15      15
## 4      47 Unknown EMS Scene Arrival Minutes      19
## 5      55                                55       2
## 6      19                                19      18
##                        HOSP_HRNAME HOSP_MN                       HOSP_MNNAME
## 1                          Unknown      47 Unknown EMS Hospital Arrival Time
## 2                          Unknown      12 Unknown EMS Hospital Arrival Time
## 3                          Unknown      21 Unknown EMS Hospital Arrival Time
## 4                          Unknown      22 Unknown EMS Hospital Arrival Time
## 5 Not Applicable (Not Transported)       8  Not Applicable (Not Transported)
## 6                    6:00pm-6:59pm      51                                51
##   FATALS DRUNK_DR
## 1      3        1
## 2      1        0
## 3      1        0
## 4      1        0
## 5      1        0
## 6      1        0

The histogram graph shows the distribution of the emergency notification hour (NOT_HOUR).

Next, we sum the frequencies to confirm (although it is clearly visible in the graph) that 8:00 PM is the hour with the highest number of notifications.

ggplot(accidentData_imputed, aes(x = NOT_HOUR)) + 
  geom_histogram(binwidth = 1, fill = "blue", alpha = 0.7) + 
  labs(title = "Distribution of Notification Hour", 
       x = "Notification Hour", 
       y = "Frequency")

frecuencia_not_hour <- table(accidentData_imputed$NOT_HOUR)
frecuencia_not_hour_df <- as.data.frame(frecuencia_not_hour)
names(frecuencia_not_hour_df) <- c("Hora", "Frecuencia")
frecuencia_maxima <- frecuencia_not_hour_df[which.max(frecuencia_not_hour_df$Frecuencia), ]
frecuencia_maxima

##    Hora Frecuencia
## 21   20       2333

Now, we create another graph to visualize fatalities by notification hour.

At first glance, we can confirm that the hour with the most notifications (8:00 PM) is also the one with the highest number of fatalities.

ggplot(accidentData_imputed %>%
         group_by(NOT_HOUR) %>%
         summarise(Total_Fatalities = sum(FATALS, na.rm = TRUE)), 
       aes(x = NOT_HOUR, y = Total_Fatalities)) +
  geom_bar(stat = "identity", fill = "red", alpha = 0.7) +
  labs(title = "Total Fatalities by Notification Hour",
       x = "Notification Hour",
       y = "Total Fatalities") +
  theme_minimal()

Now, we will create the same graph but based on time ranges.

Although we could anticipate that the number of fatalities is higher in the afternoon, this plot provides a more complete visualization, suggesting that the later it gets, the higher the number of fatalities.

accidentData_imputed <- accidentData_imputed %>%
  mutate(Time_Range = case_when(
    NOT_HOUR >= 0 & NOT_HOUR < 6 ~ "00-06",
    NOT_HOUR >= 6 & NOT_HOUR < 12 ~ "06-12",
    NOT_HOUR >= 12 & NOT_HOUR < 18 ~ "12-18",
    TRUE ~ "18-24"
  ))

fatalities_by_time_range <- accidentData_imputed %>%
  group_by(Time_Range) %>%
  summarise(Total_Fatalities = sum(FATALS, na.rm = TRUE))

ggplot(fatalities_by_time_range, aes(x = Time_Range, y = Total_Fatalities)) +
  geom_bar(stat = "identity", fill = "purple", alpha = 0.7) +
  labs(title = "Total Fatalities by Time Range",
       x = "Time Range",
       y = "Total Fatalities") +
  theme_minimal()

We perform a correlation analysis as a confirmation and determine that there is a relationship between the notification hour and the number of fatalities.

muertes_por_hora <- accidentData_imputed %>%
  group_by(NOT_HOUR) %>%
  summarise(Total_Muertes = sum(FATALS, na.rm = TRUE))

correlacion <- cor(muertes_por_hora$NOT_HOUR, muertes_por_hora$Total_Muertes, use = "complete.obs")
print(correlacion)

## [1] 0.8305662

We create a regression model to study how the notification hour (NOT_HOUR) is related to the total number of fatalities (Total_Fatalities).

The results indicate that fatalities tend to increase significantly as the day progresses, with a coefficient of 0.0347 per hour.

modelo <- glm(Total_Muertes ~ NOT_HOUR, data = muertes_por_hora, family = "poisson")
summary(modelo)

## 
## Call:
## glm(formula = Total_Muertes ~ NOT_HOUR, family = "poisson", data = muertes_por_hora)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 6.9608730  0.0110399  630.52   <2e-16 ***
## NOT_HOUR    0.0347102  0.0007459   46.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 3206.6  on 23  degrees of freedom
## Residual deviance: 1003.5  on 22  degrees of freedom
## AIC: 1227.9
## 
## Number of Fisher Scoring iterations: 4

ggplot(muertes_por_hora, aes(x = NOT_HOUR, y = Total_Muertes)) +
  geom_point(color = "blue") +
  geom_smooth(method = "glm", method.args = list(family = "poisson"), color = "red") +
  labs(title = "Relationship Between Notification Hour and Total Fatalities",
       x = "Time of Notificación",
       y = "Total of deaths") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

The analysis of the relationship between emergency notification hour (NOT_HOUR) and the number of fatalities (Total_Fatalities) provides insights into how these situations unfold. Using a regression model, we found a significant connection: for every hour that passes, the probability of fatalities increases by approximately 3.53%. This result is strong (p-value less than 2e-16), indicating that it is not due to chance.

Examining the data, we observe a cyclical pattern in the hours of the day. For example, the number of fatalities at 11:00 PM and 12:00 AM is similar, as both hours are part of the same daily transition. This finding highlights that, even as the day changes, emergency activity continues.

The majority of fatalities occur between 6:00 PM and 12:00 AM, which may be related to increased public and private activity during those hours. This suggests that emergency services should be prepared for a higher number of incidents during these critical periods.

To improve emergency response, it is essential to adjust resources and staffing based on these demand peaks. Increasing the number of ambulances and medical personnel during these key hours could make a significant difference in response times and potentially reduce the number of fatalities.

Analysis of Fatalities in Non-Motorized Vehicles Based on Weather Conditions

In this analysis, we focus on examining weather-related variables to identify potential unregistered values in the accident data. It was observed that only 0.72% of the records contain unrecognized values, a percentage so low that it is considered insignificant and will not affect the validity of our analysis.

total_registros <- nrow(accidentData)

clima_frecuencia <- accidentData %>%
  group_by(WEATHERNAME) %>%
  summarise(Frecuencia = n()) %>%
  mutate(Porcentaje = (Frecuencia / total_registros) * 100) %>%
  arrange(desc(Frecuencia))
print(clima_frecuencia)

## # A tibble: 13 × 3
##    WEATHERNAME              Frecuencia Porcentaje
##    <chr>                         <int>      <dbl>
##  1 Clear                         24963    69.8   
##  2 Cloudy                         4622    12.9   
##  3 Rain                           2634     7.36  
##  4 Not Reported                   2461     6.88  
##  5 Fog, Smog, Smoke                370     1.03  
##  6 Snow                            283     0.791 
##  7 Reported as Unknown             261     0.730 
##  8 Severe Crosswinds                56     0.157 
##  9 Freezing Rain or Drizzle         39     0.109 
## 10 Blowing Snow                     26     0.0727
## 11 Sleet or Hail                    26     0.0727
## 12 Other                            20     0.0559
## 13 Blowing Sand, Soil, Dirt          5     0.0140

ggplot(clima_frecuencia, aes(x = reorder(WEATHERNAME, -Frecuencia), y = Frecuencia)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(title = "Accident Frequency by Weather Conditions",
       x = "Weather Conditions",
       y = "Accident Frequency") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

To evaluate the frequency of different weather conditions, we performed a count of observations in each category of the WEATHERNAME variable. The results revealed the following frequencies and associated percentages:

Clear: 24,963 records, representing 69.8% of the total.
Cloudy: 4,622 records, corresponding to 12.9%.
Rain: 2,634 records, equivalent to 7.36%.
Not Reported: 2,461 records, or 6.88%.
Fog, Smog, Smoke: 370 records, 1.03%.
Snow: 283 records, representing 0.791%.
Reported as Unknown: 261 records, or 0.730%.
Severe Crosswinds: 56 records, 0.157%.
Freezing Rain or Drizzle: 39 records, 0.109%.
Blowing Snow: 26 records, representing 0.0727%.
Sleet or Hail: 26 records, also 0.0727%.
Other: 20 records, equivalent to 0.0559%.
Blowing Sand, Soil, Dirt: 5 records, 0.0140%.

Observations

The data show that the majority of accidents (69.8%) occurred under clear conditions, which suggests that weather conditions are not always a determining factor in accident severity. However, rainy and cloudy conditions also constitute a significant portion of the records, indicating the need for further investigation into the relationship between these conditions and accident severity.

The presence of a considerable percentage of records classified as “Not Reported” (6.88%) also highlights the importance of improving data collection regarding weather conditions in accident reports.

At this point, we proceed to normalize the variables. As demonstrated in the example exercise, we use the following code to normalize the variables FATALS and PERNOTMVIT, allowing us to scale all values to a common range between 0 and 1, facilitating comparison and further analysis.

path = 'accident.CSV'
accidentData <- read.csv(path, row.names=NULL)
accidentData <- accidentData %>%
  select(FATALS, PERNOTMVIT, WEATHERNAME)
nor <- function(x) {(x - min(x)) / (max(x) - min(x))}

accidentData_nor <- accidentData %>%
  mutate(FATALS = nor(FATALS), PERNOTMVIT = nor(PERNOTMVIT))

accidentData_dummies <- accidentData_nor %>%
  bind_cols(model.matrix(~WEATHERNAME - 1, data = .)) %>%
  select(-WEATHERNAME) 

head(accidentData_dummies)

##      FATALS PERNOTMVIT WEATHERNAMEBlowing Sand, Soil, Dirt
## 1 0.2857143          0                                   0
## 2 0.0000000          0                                   0
## 3 0.0000000          0                                   0
## 4 0.0000000          0                                   0
## 5 0.0000000          0                                   0
## 6 0.0000000          0                                   0
##   WEATHERNAMEBlowing Snow WEATHERNAMEClear WEATHERNAMECloudy
## 1                       0                1                 0
## 2                       0                0                 0
## 3                       0                0                 0
## 4                       0                0                 1
## 5                       0                0                 0
## 6                       0                1                 0
##   WEATHERNAMEFog, Smog, Smoke WEATHERNAMEFreezing Rain or Drizzle
## 1                           0                                   0
## 2                           0                                   0
## 3                           0                                   0
## 4                           0                                   0
## 5                           0                                   0
## 6                           0                                   0
##   WEATHERNAMENot Reported WEATHERNAMEOther WEATHERNAMERain
## 1                       0                0               0
## 2                       0                0               1
## 3                       0                0               1
## 4                       0                0               0
## 5                       0                0               1
## 6                       0                0               0
##   WEATHERNAMEReported as Unknown WEATHERNAMESevere Crosswinds
## 1                              0                            0
## 2                              0                            0
## 3                              0                            0
## 4                              0                            0
## 5                              0                            0
## 6                              0                            0
##   WEATHERNAMESleet or Hail WEATHERNAMESnow
## 1                        0               0
## 2                        0               0
## 3                        0               0
## 4                        0               0
## 5                        0               0
## 6                        0               0

We perform a Principal Component Analysis (PCA) to reduce the dimensionality of the dataset and explore its underlying structure.

The code executes the prcomp function, centering and scaling the data, which allows the variables to be analyzed on the same scale. Then, we summarize the results to obtain information on the variance explained by each principal component.

The variance proportion indicates how much information each component retains, facilitating the identification of the most relevant components.

Finally, we generate a histogram of the explained variance to visualize how the variance is distributed across the principal components.

pca.acc <- prcomp(accidentData_dummies, center = TRUE, scale. = TRUE)

summary(pca.acc)

## Importance of components:
##                           PC1     PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.3313 1.05582 1.0400 1.03116 1.00778 1.00487 1.00262
## Proportion of Variance 0.1182 0.07432 0.0721 0.07089 0.06771 0.06732 0.06702
## Cumulative Proportion  0.1182 0.19247 0.2646 0.33546 0.40317 0.47049 0.53751
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     1.00125 1.00055 1.00040 1.00036 1.00011 0.99842 0.96708
## Proportion of Variance 0.06683 0.06674 0.06672 0.06671 0.06668 0.06646 0.06235
## Cumulative Proportion  0.60434 0.67108 0.73780 0.80451 0.87119 0.93765 1.00000
##                             PC15
## Standard deviation     1.802e-13
## Proportion of Variance 0.000e+00
## Cumulative Proportion  1.000e+00

ev <- get_eig(pca.acc)
fviz_eig(pca.acc)

var_acc <- pca.acc$sdev^2
head(var_acc)

## [1] 1.772332 1.114761 1.081545 1.063291 1.015628 1.009763

num_components <- sum(var_acc > 1)

var <- get_pca_var(pca.acc)
head(var$coord[, 1:num_components], 11)

##                                           Dim.1        Dim.2         Dim.3
## FATALS                               0.02188424 -0.035643784 -0.0780970227
## PERNOTMVIT                          -0.06739766 -0.045805788  0.3018008689
## WEATHERNAMEBlowing Sand, Soil, Dirt  0.01547945  0.008151962  0.0004163935
## WEATHERNAMEBlowing Snow              0.03573920  0.020545567 -0.0162880978
## WEATHERNAMEClear                    -0.99765490 -0.045559954 -0.0089981179
## WEATHERNAMECloudy                    0.62580775 -0.767977195  0.0210824022
## WEATHERNAMEFog, Smog, Smoke          0.13626324  0.065226551 -0.0162136833
## WEATHERNAMEFreezing Rain or Drizzle  0.04400598  0.021799281 -0.0264352132
## WEATHERNAMENot Reported              0.39575512  0.480776081 -0.6953529818
## WEATHERNAMEOther                     0.03096916  0.013251108  0.0030929785
## WEATHERNAMERain                      0.41168889  0.524753223  0.7011149348
##                                           Dim.4       Dim.5        Dim.6
## FATALS                              -0.68881258  0.09145551  0.097678349
## PERNOTMVIT                           0.61401332 -0.06653455  0.054083991
## WEATHERNAMEBlowing Sand, Soil, Dirt  0.02787652 -0.04782031 -0.044492080
## WEATHERNAMEBlowing Snow              0.01886354 -0.09276140 -0.135728459
## WEATHERNAMEClear                    -0.02709641  0.03094815  0.004623439
## WEATHERNAMECloudy                    0.03948947  0.11889687  0.019350438
## WEATHERNAMEFog, Smog, Smoke         -0.08326557 -0.72437432  0.646646403
## WEATHERNAMEFreezing Rain or Drizzle -0.07790375 -0.05813001 -0.056411678
## WEATHERNAMENot Reported              0.25588567  0.19516531  0.069410173
## WEATHERNAMEOther                     0.00370126 -0.06696078 -0.005676920
## WEATHERNAMERain                     -0.13331996  0.18854710  0.020630338
##                                            Dim.7         Dim.8        Dim.9
## FATALS                              -0.144531314 -0.0798447092 -0.021804416
## PERNOTMVIT                           0.048151167 -0.1020214391  0.003238107
## WEATHERNAMEBlowing Sand, Soil, Dirt  0.081663587  0.1431814086  0.084454508
## WEATHERNAMEBlowing Snow              0.161342725  0.7024325901  0.457615332
## WEATHERNAMEClear                    -0.003044417  0.0006871405 -0.001197626
## WEATHERNAMECloudy                    0.004533127 -0.0046146431 -0.002785924
## WEATHERNAMEFog, Smog, Smoke          0.149256576 -0.0070618034 -0.002614266
## WEATHERNAMEFreezing Rain or Drizzle -0.118747827  0.4573513883 -0.774488240
## WEATHERNAMENot Reported             -0.013993173 -0.0508926548 -0.007938861
## WEATHERNAMEOther                    -0.010150974 -0.0632117856  0.035332462
## WEATHERNAMERain                      0.002423523  0.0054484870 -0.003969482
##                                           Dim.10        Dim.11       Dim.12
## FATALS                               0.012718192  0.0009081905  0.019795339
## PERNOTMVIT                          -0.027968519 -0.0219824936 -0.012842878
## WEATHERNAMEBlowing Sand, Soil, Dirt -0.101496072 -0.0781297273 -0.947091457
## WEATHERNAMEBlowing Snow             -0.326791617  0.2086704105  0.230015976
## WEATHERNAMEClear                     0.003474573  0.0037847794  0.001921345
## WEATHERNAMECloudy                    0.009312143  0.0121413165  0.004556120
## WEATHERNAMEFog, Smog, Smoke          0.028255107  0.0448464866  0.008430521
## WEATHERNAMEFreezing Rain or Drizzle -0.249976703  0.0321594655 -0.048551730
## WEATHERNAMENot Reported              0.008981734  0.0119421568  0.007537403
## WEATHERNAMEOther                    -0.470447865 -0.8573276481  0.142215880
## WEATHERNAMERain                      0.015634162  0.0189832748  0.007401653

var_acc <- pca.acc$sdev^2
proporciones_varianza <- var_acc / sum(var_acc)

scree_data <- data.frame(
  Component = 1:length(var_acc),
  Variance = proporciones_varianza
)

ggplot(scree_data, aes(x = Component, y = Variance)) +
  geom_line() +
  geom_point() +
  labs(title = "Scree Plot",
       x = "Component Number",
       y = "Proportion of Explained Variance") +
  theme_minimal()

PCA Analysis

The Principal Component Analysis (PCA) applied to the accident data reveals that the first two principal components explain approximately 19% of the total variance, suggesting that a significant portion of the information can be effectively represented in a lower-dimensional space. Additionally, the proportion of variance explained by the first six principal components (PC1 to PC6) accounts for more than 47% of the total variance.

Notably, the first principal component (PC1) has a high negative loading on WEATHERNAMEClear, suggesting that under clear weather conditions, more fatalities occur in accidents. The second component (PC2) shows a positive relationship with PERNOTMVIT and WEATHERNAMERain, indicating that rain may increase accident severity. Meanwhile, the third component (PC3) presents a complex combination of WEATHERNAMEFog, Smog, Smoke, and WEATHERNAMERain, suggesting that adverse weather conditions may be associated with an increased number of fatal accidents.

Global Analysis

The preliminary analysis, which reveals that 69.8% of non-motorized vehicle accidents occur under clear weather conditions, suggests important implications for road safety and driver behavior. This finding highlights that while clear conditions are considered safe for driving, accidents still frequently occur. Many non-motorized vehicle users may assume that good weather ensures greater safety, potentially leading to riskier behaviors such as speeding or inattention to road hazards.

On the other hand, the fact that adverse weather conditions such as fog, smoke, or snow account for less than 2% of recorded accidents suggests that, although rare, these situations may significantly impact accident severity. It is possible that drivers are more cautious in these conditions, resulting in fewer accidents overall. However, when they do occur, these accidents tend to be more severe due to factors such as reduced visibility and loss of vehicle control.

The PCA analysis performed on the normalized dataset yielded interesting results regarding the relationship between weather conditions and accidents. It was observed that the first six principal components (PC1 to PC6) explain more than 47% of the total variance, indicating that a relatively small number of variables capture most of the information in the dataset.

The first component (PC1) is notable for its high negative loading on clear weather conditions (WEATHERNAMEClear). This suggests that, contrary to expectations, good weather conditions are associated with a higher number of fatalities in accidents. This finding is surprising, as good weather is typically linked to safer driving conditions.

The second component (PC2) shows a positive relationship with non-motorized vehicle variables (PERNOTMVIT) and rain conditions (WEATHERNAMERain). This indicates that rain can contribute to more severe accidents, emphasizing the need for drivers to exercise greater caution in such conditions.

Finally, the third component (PC3) presents a combination of complex weather conditions, such as fog, smog, and smoke, along with rain. This suggests that when weather conditions worsen, accidents become more hazardous. These results underline the importance of road safety awareness, especially on days with adverse meteorological conditions where visibility and vehicle control may be compromised.

To conclude, the fact that most fatal accidents occur on clear days may be linked to increased traffic volume. When the weather is good, more people tend to drive, increasing the likelihood of collisions. Additionally, favorable weather may lead some drivers to become overconfident, resulting in risky behaviors such as speeding, under the false assumption that they are safer.

Conversely, accidents that occur in the rain tend to be more severe. This may be due to the fact that rain makes roads slippery and reduces visibility. Consequently, drivers’ reaction times increase, and vehicles can skid more easily, leading to hazardous situations. The combination of rain and high speeds is particularly risky, especially when drivers fail to adjust their driving behavior to the weather conditions.

Regarding situations such as fog or snow, although they occur less frequently, when they do happen, the impact can be more severe. These conditions drastically reduce visibility and may lead to multi-vehicle accidents, particularly on congested or high-speed roads.

In general, it is crucial for all drivers to consider weather conditions and adjust their driving style accordingly, regardless of whether the weather appears to be perfect or not.

Traffic Accident Analysis: Emergency Response, Weather Conditions, and Risk Factors

Autor: DataManz

Octubre 2024