Based on a dataset that you find interesting, propose a complete data mining project. The structure of the response must align with the typical phases of the data mining project lifecycle. There is no need to carry out the tasks of each phase.
It is expected that the response will address the following questions in a structured manner (using the CRISP-DM methodology):
For each phase, indicate the phase’s objective and the expected outcome. Use examples to illustrate what and how the tasks could be performed. If there are any unique characteristics that differentiate the lifecycle of a data mining project from other types of projects, highlight them.
This project focuses on developing an effective system to enhance safety in industrial and urban environments, where fires can occur at any time. Early detection of fire and smoke is crucial for protecting lives and property.
In this phase, the goal is to identify the business needs and expectations regarding the fire and smoke detection system. It is essential to understand current risks, the areas most prone to fires, and how this system can enhance safety and emergency response. By the end of this stage, we should be able to answer key questions and determine whether the metrics used can define success or failure.
We will conduct interviews with key stakeholders, such as corporate safety officers and firefighters, analyze statistics on fire and smoke incidents in different environments (industrial and urban), and establish success criteria, such as a target percentage of correct detections and response time.
This system can benefit various industries seeking to protect customers, employees, or civilians, such as occupational safety companies, chemical industries handling flammable materials, and emergency response services for firefighters, among others.
Question: What does the business need?
Answer: The business needs an efficient fire and smoke detection system that enhances safety and emergency response across various industries.
In this phase, the objective is to explore the available dataset to understand its quality, structure, and relevance to the project. The goal is to answer questions such as: What data do we have? Are they sufficient and representative? Are there quality issues? The expected outcome is a descriptive report on the dataset, including statistical summaries and quality analysis.
This phase also involves examining the number of images per class (fire and smoke) to ensure a balanced dataset, verifying annotation quality through sampling to ensure bounding boxes are correctly defined, and identifying and documenting issues such as low-quality, blurry, misclassified, or duplicate images.
According to the dataset, we have two classes (translated literally):
0: Fire – Images containing visible flames or areas where a fire is clearly present.
1: Smoke – Images with visible smoke, either in the early stages of fire development or due to environmental factors.
Dataset Composition:
The dataset includes over 35,000 labeled images for fire and smoke detection.
The images vary in lighting conditions, resolutions, and environmental contexts to ensure the model generalizes well across different real-world scenarios.
Training Data: Images used for model training, with balanced examples of fire and smoke.
Validation Data: Used to fine-tune model hyperparameters and validate performance.
Test Data: Held-out data for final model evaluation, containing unseen images of fire and smoke.
Questions:
Question: What data do we have/need? Are they clean?
Answer: We have over 35,000 labeled images of fire and smoke, but we need to verify annotation quality and ensure proper class balance.
Data preparation involves cleaning, transforming, and organizing the dataset to ensure it is ready for modeling. This phase is crucial to minimizing errors in modeling and maximizing the quality of the results. The goal is a structured and clean dataset that can be used to train detection models.
During this phase, key tasks include normalizing images to a standard size and format (JPEG), correcting annotation errors such as incorrect bounding box coordinates or misassigned labels, and splitting the dataset into training, validation, and test sets while maintaining class distribution.
“Once data sources are identified, we must proceed with preparing them so that they can be used with the methods or tools that will build the desired model. This phase, although seemingly simple, along with data selection, consumes 70% (or more!) of the effort in newly implemented data mining projects.” Page 20 - PID_00284574
Question: How do we organize the data for modeling?
Answer: We organize the data by normalizing images, correcting annotations, and splitting the dataset into training, validation, and test sets.
In this phase, the goal is to select and apply the best techniques to solve the problem of detecting fire and smoke. The objective is to create a model that generalizes well to new data and meets the defined success criteria. By the end of this phase, we will have trained and validated object detection models ready for evaluation.
…include implementing various object detection algorithms, such as YOLO and Faster R-CNN, while tuning their hyperparameters to optimize performance, performing cross-validation to assess the model’s robustness across different data subsets, and analyzing performance metrics like precision and recall to identify which models perform best for each class…
“The task of these data mining projects is not exactly the same as in the previous point. Here, it is more common to start from a more informed situation, knowing that pre-defined groups already exist.” Page 13 - PID_00284574
Question: What modeling techniques should we apply?
Answer: We will apply object detection algorithms such as YOLO and Faster R-CNN, fine-tuning their hyperparameters to optimize performance.
The goal is to determine whether the model meets business requirements and expectations. An evaluation report will be generated, including a performance analysis and recommendations.
At this stage, we seek to assess whether the model fulfills the pre-established requirements and expectations, evaluating its usefulness both from a technical and business perspective. Validation techniques will be employed to measure model performance across different datasets, providing a solid evaluation. Additionally, evaluation metrics such as precision, recall, and F1-score will be analyzed to gain deeper insight into model behavior.
It is also essential to compare the model’s results with other available alternatives. This comparison is crucial to determine the model’s effectiveness and whether other approaches should be explored. This process is not linear; instead, it involves continuous review and refinement to ensure that the model remains relevant and effective as conditions change and new challenges arise.
“This process is not linear; rather, it is iterative and continuous: new changes in the situation may render our knowledge outdated, requiring us to extract new insights.” Page 12 - PID_00284574
Question: Which model best meets the business objectives?
Answer: We will evaluate trained models using cross-validation and performance metrics to determine which best meets business objectives.
The final phase of the project lifecycle aims to integrate the model into an operational system, allowing end users to access its results effectively. The fire and smoke detection system is deployed, ensuring that it functions properly in a real-world environment and meets the established requirements.
Additionally, continuous monitoring of the model is essential to ensure it continues to operate optimally. This includes making periodic adjustments based on results and changing conditions, guaranteeing that the system maintains its effectiveness and continues to achieve the project objectives. (This involves continuously reapplying Phase 5: Evaluation.)
“Once the objective is defined, and when we have linked it to the project’s main task, identifying which models are most relevant and what methods and tools are needed, we must proceed to find the raw material: the data.” Page 19 - PID_00284574
Question: How do stakeholders access the results?
Answer: Stakeholders will access the results through an integrated system that enables real-time visualization and analysis of detections.
If there is any characteristic that differentiates the lifecycle of a data mining project from other projects, indicate it.
The lifecycle of a data mining project is iterative and adaptable. Unlike other data analysis projects that may follow a more linear approach, data mining requires continuous revisions and the incorporation of new data.
This ensures that models remain relevant and effective in real-world situations, which is critical for safety applications such as fire and smoke detection.
This approach reflects the need to “define the data mining task” and understand that achieving objectives may require a combined effort, where data is first grouped, then classified, and finally, a predictive model is extracted.
This iterative process is also closely linked to “model evaluation and interpretation,” where continuous validation and model adaptation to new circumstances are essential for project success. Additionally, ensuring that the extracted knowledge is valid and applicable in practice is a key principle in data mining, reinforcing the need for effective integration into the organization’s information system.
This iterative cycle not only enables continuous model improvement but also aligns with the goal of “explaining” behaviors, allowing analysts to understand and adjust the reasoning behind the results. This is particularly crucial in critical environments like fire and smoke detection.
Using the dataset from the PEC example, perform the preliminary tasks for generating a data mining model, as explained in the modules “The Data Mining Process” and “Data Preprocessing and Feature Management”.
You may use the PEC example as a reference, but you should change the
approach and analyze the data based on different dimensions. Thus, you
cannot use the same combination of variables as in the example:
"FATALS", "DRUNK_DR", "VE_TOTAL", "VE_FORMS", "PVH_INVL", "PEDS", "PERSONS", "PERMVIT", "PERNOTMVIT".
You must analyze any other combination, which may include some of these
variables along with new ones.
Optionally, and as an added value, you may incorporate data from other years for temporal comparisons (https://www.nhtsa.gov/file-downloads?p=nhtsa/downloads/FARS/) or include additional factors for analysis, such as drug use in accidents (https://static.nhtsa.gov/nhtsa/downloads/FARS/2020/National/FARS2020NationalCSV.zip).
A dataset from the National Highway Traffic Safety Administration (NHTSA) for the year 2020 has been selected. This dataset records accidents with at least one fatality. The objective is to understand what factors contribute to an accident being classified as severe and what defines this severity.
Before starting the analysis, we install the necessary libraries
using an if statement to check if they are already
installed, preventing conflicts in our code.
if (!require('ggplot2')) install.packages('ggplot2'); library('ggplot2')
## Cargando paquete requerido: ggplot2
if(!require('Rmisc')) install.packages('Rmisc'); library('Rmisc')
## Cargando paquete requerido: Rmisc
## Cargando paquete requerido: lattice
## Cargando paquete requerido: plyr
if(!require('dplyr')) install.packages('dplyr'); library('dplyr')
## Cargando paquete requerido: dplyr
##
## Adjuntando el paquete: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
if(!require('xfun')) install.packages('xfun'); library('xfun')
## Cargando paquete requerido: xfun
##
## Adjuntando el paquete: 'xfun'
## The following object is masked from 'package:base':
##
## attr
if(!require('factoextra')) install.packages('factoextra', dependencies = TRUE)
## Cargando paquete requerido: factoextra
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
if(!require('mice')) install.packages('mice', dependencies = TRUE)
## Cargando paquete requerido: mice
##
## Adjuntando el paquete: 'mice'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
Load the dataset and use the same naming conventions as in the guided example. Finally, we examine the structure of the dataset.
path = 'accident.CSV'
accidentData <- read.csv(path, row.names=NULL)
structure = str(accidentData)
## 'data.frame': 35766 obs. of 81 variables:
## $ STATE : int 1 1 1 1 1 1 1 1 1 1 ...
## $ STATENAME : chr "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ ST_CASE : int 10001 10002 10003 10004 10005 10006 10007 10008 10009 10010 ...
## $ VE_TOTAL : int 1 4 2 1 1 2 1 2 2 2 ...
## $ VE_FORMS : int 1 4 2 1 1 2 1 2 2 2 ...
## $ PVH_INVL : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PEDS : int 0 0 0 0 0 0 1 0 0 0 ...
## $ PERSONS : int 4 6 2 5 1 3 1 2 4 3 ...
## $ PERMVIT : int 4 6 2 5 1 3 1 2 4 3 ...
## $ PERNOTMVIT : int 0 0 0 0 0 0 1 0 0 0 ...
## $ COUNTY : int 51 73 117 15 37 103 73 25 45 95 ...
## $ COUNTYNAME : chr "ELMORE (51)" "JEFFERSON (73)" "SHELBY (117)" "CALHOUN (15)" ...
## $ CITY : int 0 350 0 0 0 0 330 0 0 1500 ...
## $ CITYNAME : chr "NOT APPLICABLE" "BIRMINGHAM" "NOT APPLICABLE" "NOT APPLICABLE" ...
## $ DAY : int 1 2 2 3 4 4 7 8 9 10 ...
## $ DAYNAME : int 1 2 2 3 4 4 7 8 9 10 ...
## $ MONTH : int 1 1 1 1 1 1 1 1 1 1 ...
## $ MONTHNAME : chr "January" "January" "January" "January" ...
## $ YEAR : int 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
## $ DAY_WEEK : int 4 5 5 6 7 7 3 4 5 6 ...
## $ DAY_WEEKNAME: chr "Wednesday" "Thursday" "Thursday" "Friday" ...
## $ HOUR : int 2 17 14 15 0 16 19 7 20 10 ...
## $ HOURNAME : chr "2:00am-2:59am" "5:00pm-5:59pm" "2:00pm-2:59pm" "3:00pm-3:59pm" ...
## $ MINUTE : int 58 18 55 20 45 55 23 15 0 2 ...
## $ MINUTENAME : chr "58" "18" "55" "20" ...
## $ NHS : int 0 0 0 0 0 0 0 0 0 1 ...
## $ NHSNAME : chr "This section IS NOT on the NHS" "This section IS NOT on the NHS" "This section IS NOT on the NHS" "This section IS NOT on the NHS" ...
## $ ROUTE : int 4 6 3 4 4 3 4 4 4 2 ...
## $ ROUTENAME : chr "County Road" "Local Street - Municipality" "State Highway" "County Road" ...
## $ TWAY_ID : chr "cr-4" "martin luther king jr dr" "sr-76" "CR-ALEXANDRIA WELLINGTON RD" ...
## $ TWAY_ID2 : chr "" "" "us-280" "" ...
## $ RUR_URB : int 1 2 1 1 1 1 2 1 1 1 ...
## $ RUR_URBNAME : chr "Rural" "Urban" "Rural" "Rural" ...
## $ FUNC_SYS : int 5 4 4 7 5 4 4 5 5 3 ...
## $ FUNC_SYSNAME: chr "Major Collector" "Minor Arterial" "Minor Arterial" "Local" ...
## $ RD_OWNER : int 2 4 1 2 2 1 4 2 2 1 ...
## $ RD_OWNERNAME: chr "County Highway Agency" "City or Municipal Highway Agency" "State Highway Agency" "County Highway Agency" ...
## $ MILEPT : int 0 0 49 0 0 390 0 0 0 3019 ...
## $ MILEPTNAME : chr "None" "None" "49" "None" ...
## $ LATITUDE : num 32.4 33.5 33.3 33.8 32.8 ...
## $ LATITUDENAME: chr "32.43313333" "33.48465833" "33.29994167" "33.79507222" ...
## $ LONGITUD : num -86.1 -86.8 -86.4 -85.9 -86.1 ...
## $ LONGITUDNAME: chr "-86.09485" "-86.83954444" "-86.36964167" "-85.88348611" ...
## $ SP_JUR : int 0 0 0 0 0 0 0 0 0 0 ...
## $ SP_JURNAME : chr "No Special Jurisdiction" "No Special Jurisdiction" "No Special Jurisdiction" "No Special Jurisdiction" ...
## $ HARM_EV : int 42 12 34 42 42 12 8 12 12 12 ...
## $ HARM_EVNAME : chr "Tree (Standing Only)" "Motor Vehicle In-Transport" "Ditch" "Tree (Standing Only)" ...
## $ MAN_COLL : int 0 6 0 0 0 2 0 1 1 2 ...
## $ MAN_COLLNAME: chr "The First Harmful Event was Not a Collision with a Motor Vehicle in Transport" "Angle" "The First Harmful Event was Not a Collision with a Motor Vehicle in Transport" "The First Harmful Event was Not a Collision with a Motor Vehicle in Transport" ...
## $ RELJCT1 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ RELJCT1NAME : chr "No" "No" "No" "No" ...
## $ RELJCT2 : int 1 1 3 1 1 1 3 1 8 1 ...
## $ RELJCT2NAME : chr "Non-Junction" "Non-Junction" "Intersection-Related" "Non-Junction" ...
## $ TYP_INT : int 1 1 3 1 1 1 2 1 1 1 ...
## $ TYP_INTNAME : chr "Not an Intersection" "Not an Intersection" "T-Intersection" "Not an Intersection" ...
## $ WRK_ZONE : int 0 0 0 0 0 0 0 0 0 0 ...
## $ WRK_ZONENAME: chr "None" "None" "None" "None" ...
## $ REL_ROAD : int 4 1 4 4 4 1 1 1 1 1 ...
## $ REL_ROADNAME: chr "On Roadside" "On Roadway" "On Roadside" "On Roadside" ...
## $ LGT_COND : int 2 3 1 1 2 2 3 1 2 1 ...
## $ LGT_CONDNAME: chr "Dark - Not Lighted" "Dark - Lighted" "Daylight" "Daylight" ...
## $ WEATHER : int 1 2 2 10 2 1 1 1 10 10 ...
## $ WEATHERNAME : chr "Clear" "Rain" "Rain" "Cloudy" ...
## $ SCH_BUS : int 0 0 0 0 0 0 0 0 0 0 ...
## $ SCH_BUSNAME : chr "No" "No" "No" "No" ...
## $ RAIL : chr "0000000" "0000000" "0000000" "0000000" ...
## $ RAILNAME : chr "Not Applicable" "Not Applicable" "Not Applicable" "Not Applicable" ...
## $ NOT_HOUR : int 99 17 14 99 0 17 19 7 20 10 ...
## $ NOT_HOURNAME: chr "Unknown" "5:00pm-5:59pm" "2:00pm-2:59pm" "Unknown" ...
## $ NOT_MIN : int 99 18 58 99 45 0 23 21 0 3 ...
## $ NOT_MINNAME : chr "Unknown" "18" "58" "Unknown" ...
## $ ARR_HOUR : int 3 17 15 99 0 17 19 7 20 10 ...
## $ ARR_HOURNAME: chr "3:00am-3:59am" "5:00pm-5:59pm" "3:00pm-3:59pm" "Unknown EMS Scene Arrival Hour" ...
## $ ARR_MIN : int 10 26 15 99 55 19 29 28 10 7 ...
## $ ARR_MINNAME : chr "10" "26" "15" "Unknown EMS Scene Arrival Minutes" ...
## $ HOSP_HR : int 99 99 99 99 88 18 88 88 99 10 ...
## $ HOSP_HRNAME : chr "Unknown" "Unknown" "Unknown" "Unknown" ...
## $ HOSP_MN : int 99 99 99 99 88 51 88 88 99 29 ...
## $ HOSP_MNNAME : chr "Unknown EMS Hospital Arrival Time" "Unknown EMS Hospital Arrival Time" "Unknown EMS Hospital Arrival Time" "Unknown EMS Hospital Arrival Time" ...
## $ FATALS : int 3 1 1 1 1 1 1 1 1 1 ...
## $ DRUNK_DR : int 1 0 0 0 0 0 0 0 0 0 ...
Although we already knew this from the guided example, we obtain the number of observations and variables.
num_observaciones <- nrow(accidentData)
num_variables <- ncol(accidentData)
cat("NĂşmero de observaciones:", num_observaciones, "\n")
## NĂşmero de observaciones: 35766
cat("NĂşmero de variables:", num_variables, "\n")
## NĂşmero de variables: 81
We continue following the initial guidelines from the guided example, as the steps remain the same. Now, we must review the variables and validate them with the documentation to prevent errors before starting our analysis. The variables are logically arranged to provide meaning, and we complete them with information in Spanish.
FACTS TO STUDY
GEOGRAPHIC DIMENSION
TEMPORAL DIMENSION
ACCIDENT CONDITIONS DIMENSION
METEOROLOGICAL DIMENSION
OTHER FACTORS
EMERGENCY SERVICE DIMENSION
ACCIDENT-RELATED FACTORS DIMENSION
Finally, before starting data preprocessing, we conduct an overview of the basic statistics of our dataset.
summary(accidentData)
## STATE STATENAME ST_CASE VE_TOTAL
## Min. : 1.00 Length:35766 Min. : 10001 Min. : 1.00
## 1st Qu.:12.00 Class :character 1st Qu.:122078 1st Qu.: 1.00
## Median :26.00 Mode :character Median :260917 Median : 1.00
## Mean :27.16 Mean :272387 Mean : 1.56
## 3rd Qu.:42.00 3rd Qu.:420477 3rd Qu.: 2.00
## Max. :56.00 Max. :560115 Max. :15.00
## VE_FORMS PVH_INVL PEDS PERSONS
## Min. : 1.000 Min. : 0.00000 Min. :0.0000 Min. : 0.000
## 1st Qu.: 1.000 1st Qu.: 0.00000 1st Qu.:0.0000 1st Qu.: 1.000
## Median : 1.000 Median : 0.00000 Median :0.0000 Median : 2.000
## Mean : 1.517 Mean : 0.04269 Mean :0.2285 Mean : 2.173
## 3rd Qu.: 2.000 3rd Qu.: 0.00000 3rd Qu.:0.0000 3rd Qu.: 3.000
## Max. :15.000 Max. :10.00000 Max. :8.0000 Max. :61.000
## PERMVIT PERNOTMVIT COUNTY COUNTYNAME
## Min. : 0.000 Min. :0.0000 Min. : 1.00 Length:35766
## 1st Qu.: 1.000 1st Qu.:0.0000 1st Qu.: 31.00 Class :character
## Median : 2.000 Median :0.0000 Median : 71.00 Mode :character
## Mean : 2.163 Mean :0.2387 Mean : 93.06
## 3rd Qu.: 3.000 3rd Qu.:0.0000 3rd Qu.:117.00
## Max. :61.000 Max. :9.0000 Max. :999.00
## CITY CITYNAME DAY DAYNAME
## Min. : 0 Length:35766 Min. : 1.00 Min. : 1.00
## 1st Qu.: 0 Class :character 1st Qu.: 8.00 1st Qu.: 8.00
## Median : 120 Mode :character Median :16.00 Median :16.00
## Mean :1436 Mean :15.71 Mean :15.71
## 3rd Qu.:2080 3rd Qu.:23.00 3rd Qu.:23.00
## Max. :9999 Max. :31.00 Max. :31.00
## MONTH MONTHNAME YEAR DAY_WEEK
## Min. : 1.000 Length:35766 Min. :2020 Min. :1.000
## 1st Qu.: 4.000 Class :character 1st Qu.:2020 1st Qu.:2.000
## Median : 7.000 Mode :character Median :2020 Median :4.000
## Mean : 6.898 Mean :2020 Mean :4.114
## 3rd Qu.:10.000 3rd Qu.:2020 3rd Qu.:6.000
## Max. :12.000 Max. :2020 Max. :7.000
## DAY_WEEKNAME HOUR HOURNAME MINUTE
## Length:35766 Min. : 0.00 Length:35766 Min. : 0.00
## Class :character 1st Qu.: 7.00 Class :character 1st Qu.:14.00
## Mode :character Median :15.00 Mode :character Median :30.00
## Mean :13.94 Mean :29.24
## 3rd Qu.:19.00 3rd Qu.:45.00
## Max. :99.00 Max. :99.00
## MINUTENAME NHS NHSNAME ROUTE
## Length:35766 Min. :0.0000 Length:35766 Min. :1.000
## Class :character 1st Qu.:0.0000 Class :character 1st Qu.:2.000
## Mode :character Median :0.0000 Mode :character Median :3.000
## Mean :0.5877 Mean :3.901
## 3rd Qu.:1.0000 3rd Qu.:6.000
## Max. :9.0000 Max. :9.000
## ROUTENAME TWAY_ID TWAY_ID2 RUR_URB
## Length:35766 Length:35766 Length:35766 Min. :1.000
## Class :character Class :character Class :character 1st Qu.:1.000
## Mode :character Mode :character Mode :character Median :2.000
## Mean :1.662
## 3rd Qu.:2.000
## Max. :9.000
## RUR_URBNAME FUNC_SYS FUNC_SYSNAME RD_OWNER
## Length:35766 Min. : 1.000 Length:35766 Min. : 1.00
## Class :character 1st Qu.: 3.000 Class :character 1st Qu.: 1.00
## Mode :character Median : 4.000 Mode :character Median : 1.00
## Mean : 6.038 Mean :19.96
## 3rd Qu.: 5.000 3rd Qu.: 4.00
## Max. :99.000 Max. :99.00
## RD_OWNERNAME MILEPT MILEPTNAME LATITUDE
## Length:35766 Min. : 0.0 Length:35766 Min. : 19.09
## Class :character 1st Qu.: 2.0 Class :character 1st Qu.: 32.99
## Mode :character Median : 80.0 Mode :character Median : 36.17
## Mean :19990.7 Mean : 36.90
## 3rd Qu.: 955.5 3rd Qu.: 40.45
## Max. :99999.0 Max. :100.00
## LATITUDENAME LONGITUD LONGITUDNAME SP_JUR
## Length:35766 Min. :-165.30 Length:35766 Min. :0.00000
## Class :character 1st Qu.: -97.90 Class :character 1st Qu.:0.00000
## Mode :character Median : -87.81 Mode :character Median :0.00000
## Mean : -84.59 Mean :0.04029
## 3rd Qu.: -81.52 3rd Qu.:0.00000
## Max. :1000.00 Max. :9.00000
## SP_JURNAME HARM_EV HARM_EVNAME MAN_COLL
## Length:35766 Min. : 1.00 Length:35766 Min. : 0.000
## Class :character 1st Qu.: 8.00 Class :character 1st Qu.: 0.000
## Mode :character Median :12.00 Mode :character Median : 0.000
## Mean :18.31 Mean : 1.929
## 3rd Qu.:30.00 3rd Qu.: 2.000
## Max. :99.00 Max. :99.000
## MAN_COLLNAME RELJCT1 RELJCT1NAME RELJCT2
## Length:35766 Min. :0.00000 Length:35766 Min. : 1.000
## Class :character 1st Qu.:0.00000 Class :character 1st Qu.: 1.000
## Mode :character Median :0.00000 Mode :character Median : 1.000
## Mean :0.07283 Mean : 2.368
## 3rd Qu.:0.00000 3rd Qu.: 2.000
## Max. :9.00000 Max. :99.000
## RELJCT2NAME TYP_INT TYP_INTNAME WRK_ZONE
## Length:35766 Min. : 1.000 Length:35766 Min. :0.00000
## Class :character 1st Qu.: 1.000 Class :character 1st Qu.:0.00000
## Mode :character Median : 1.000 Mode :character Median :0.00000
## Mean : 1.764 Mean :0.04748
## 3rd Qu.: 1.000 3rd Qu.:0.00000
## Max. :99.000 Max. :4.00000
## WRK_ZONENAME REL_ROAD REL_ROADNAME LGT_COND
## Length:35766 Min. : 1.000 Length:35766 Min. :1.000
## Class :character 1st Qu.: 1.000 Class :character 1st Qu.:1.000
## Mode :character Median : 1.000 Mode :character Median :2.000
## Mean : 2.557 Mean :1.961
## 3rd Qu.: 4.000 3rd Qu.:3.000
## Max. :99.000 Max. :9.000
## LGT_CONDNAME WEATHER WEATHERNAME SCH_BUS
## Length:35766 Min. : 1.000 Length:35766 Min. :0.000000
## Class :character 1st Qu.: 1.000 Class :character 1st Qu.:0.000000
## Mode :character Median : 1.000 Mode :character Median :0.000000
## Mean : 9.725 Mean :0.001426
## 3rd Qu.: 2.000 3rd Qu.:0.000000
## Max. :99.000 Max. :1.000000
## SCH_BUSNAME RAIL RAILNAME NOT_HOUR
## Length:35766 Length:35766 Length:35766 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.:16.00
## Mode :character Mode :character Mode :character Median :99.00
## Mean :61.39
## 3rd Qu.:99.00
## Max. :99.00
## NOT_HOURNAME NOT_MIN NOT_MINNAME ARR_HOUR
## Length:35766 Min. : 0.00 Length:35766 Min. : 0.00
## Class :character 1st Qu.:34.00 Class :character 1st Qu.:16.00
## Mode :character Median :98.00 Mode :character Median :99.00
## Mean :68.46 Mean :61.88
## 3rd Qu.:99.00 3rd Qu.:99.00
## Max. :99.00 Max. :99.00
## ARR_HOURNAME ARR_MIN ARR_MINNAME HOSP_HR
## Length:35766 Min. : 0.00 Length:35766 Min. : 0.00
## Class :character 1st Qu.:34.00 Class :character 1st Qu.:88.00
## Mode :character Median :98.00 Mode :character Median :88.00
## Mean :68.74 Mean :77.59
## 3rd Qu.:99.00 3rd Qu.:99.00
## Max. :99.00 Max. :99.00
## HOSP_HRNAME HOSP_MN HOSP_MNNAME FATALS
## Length:35766 Min. : 0.00 Length:35766 Min. :1.000
## Class :character 1st Qu.:88.00 Class :character 1st Qu.:1.000
## Mode :character Median :88.00 Mode :character Median :1.000
## Mean :80.76 Mean :1.085
## 3rd Qu.:99.00 3rd Qu.:1.000
## Max. :99.00 Max. :8.000
## DRUNK_DR
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.2664
## 3rd Qu.:1.0000
## Max. :4.0000
In this phase, we will transform our dataset into a new one by removing all information that we consider erroneous or of little value, allowing us to begin our analysis.
Logically, we will check for missing values in both forms.
missing_values <- colSums(is.na(accidentData)) + colSums(accidentData == "")
print(missing_values)
## STATE STATENAME ST_CASE VE_TOTAL VE_FORMS PVH_INVL
## 0 0 0 0 0 0
## PEDS PERSONS PERMVIT PERNOTMVIT COUNTY COUNTYNAME
## 0 0 0 0 0 0
## CITY CITYNAME DAY DAYNAME MONTH MONTHNAME
## 0 0 0 0 0 0
## YEAR DAY_WEEK DAY_WEEKNAME HOUR HOURNAME MINUTE
## 0 0 0 0 0 0
## MINUTENAME NHS NHSNAME ROUTE ROUTENAME TWAY_ID
## 0 0 0 0 0 0
## TWAY_ID2 RUR_URB RUR_URBNAME FUNC_SYS FUNC_SYSNAME RD_OWNER
## 26997 0 0 0 0 0
## RD_OWNERNAME MILEPT MILEPTNAME LATITUDE LATITUDENAME LONGITUD
## 0 0 0 0 0 0
## LONGITUDNAME SP_JUR SP_JURNAME HARM_EV HARM_EVNAME MAN_COLL
## 0 0 0 0 0 0
## MAN_COLLNAME RELJCT1 RELJCT1NAME RELJCT2 RELJCT2NAME TYP_INT
## 0 0 0 0 0 0
## TYP_INTNAME WRK_ZONE WRK_ZONENAME REL_ROAD REL_ROADNAME LGT_COND
## 0 0 0 0 0 0
## LGT_CONDNAME WEATHER WEATHERNAME SCH_BUS SCH_BUSNAME RAIL
## 0 0 0 0 0 0
## RAILNAME NOT_HOUR NOT_HOURNAME NOT_MIN NOT_MINNAME ARR_HOUR
## 0 0 0 0 0 0
## ARR_HOURNAME ARR_MIN ARR_MINNAME HOSP_HR HOSP_HRNAME HOSP_MN
## 0 0 0 0 0 0
## HOSP_MNNAME FATALS DRUNK_DR
## 0 0 0
We observe that the variable TWAY_ID has 26,997 missing values, so we will analyze this variable.
str(accidentData$TWAY_ID2)
## chr [1:35766] "" "" "us-280" "" "" "" "17TH ST" "" "" "" "cr-400" ...
missing_tway_id2 <- sum(is.na(accidentData$TWAY_ID2))
cat("Number of missing values in TWAY_ID2:", missing_tway_id2, "\n")
## Number of missing values in TWAY_ID2: 0
We understand that data for these roads is unavailable for some reason. However, in my emergency analysis, which will be presented later, it is essential to consider the road as an additional variable to interpret the results.
Therefore, we will account for the transformation of this variable.
To address this issue, we will fill the missing values with
"notspecify" and verify that this procedure is correctly
applied.
accidentData$TWAY_ID2 <- ifelse(accidentData$TWAY_ID2 == "", "notspecify", accidentData$TWAY_ID2)
missing_values <- colSums(is.na(accidentData)) + colSums(accidentData == "")
print(missing_values)
## STATE STATENAME ST_CASE VE_TOTAL VE_FORMS PVH_INVL
## 0 0 0 0 0 0
## PEDS PERSONS PERMVIT PERNOTMVIT COUNTY COUNTYNAME
## 0 0 0 0 0 0
## CITY CITYNAME DAY DAYNAME MONTH MONTHNAME
## 0 0 0 0 0 0
## YEAR DAY_WEEK DAY_WEEKNAME HOUR HOURNAME MINUTE
## 0 0 0 0 0 0
## MINUTENAME NHS NHSNAME ROUTE ROUTENAME TWAY_ID
## 0 0 0 0 0 0
## TWAY_ID2 RUR_URB RUR_URBNAME FUNC_SYS FUNC_SYSNAME RD_OWNER
## 0 0 0 0 0 0
## RD_OWNERNAME MILEPT MILEPTNAME LATITUDE LATITUDENAME LONGITUD
## 0 0 0 0 0 0
## LONGITUDNAME SP_JUR SP_JURNAME HARM_EV HARM_EVNAME MAN_COLL
## 0 0 0 0 0 0
## MAN_COLLNAME RELJCT1 RELJCT1NAME RELJCT2 RELJCT2NAME TYP_INT
## 0 0 0 0 0 0
## TYP_INTNAME WRK_ZONE WRK_ZONENAME REL_ROAD REL_ROADNAME LGT_COND
## 0 0 0 0 0 0
## LGT_CONDNAME WEATHER WEATHERNAME SCH_BUS SCH_BUSNAME RAIL
## 0 0 0 0 0 0
## RAILNAME NOT_HOUR NOT_HOURNAME NOT_MIN NOT_MINNAME ARR_HOUR
## 0 0 0 0 0 0
## ARR_HOURNAME ARR_MIN ARR_MINNAME HOSP_HR HOSP_HRNAME HOSP_MN
## 0 0 0 0 0 0
## HOSP_MNNAME FATALS DRUNK_DR
## 0 0 0
Before starting the analysis, we will focus on fatalities, set a global context, and humanize the topic we are addressing—fatalities and factors such as the most dangerous day of the week and the time of occurrence.
total_fatalities <- sum(accidentData$FATALS, na.rm = TRUE)
cat("Total number of fatalities:", total_fatalities, "\n")
## Total number of fatalities: 38824
fatalities_by_day <- accidentData %>%
group_by(DAY_WEEKNAME) %>%
summarise(Total_Fatalities = sum(FATALS, na.rm = TRUE)) %>%
arrange(desc(Total_Fatalities))
cat("\nFatalities by day of the week:\n")
##
## Fatalities by day of the week:
print(fatalities_by_day)
## # A tibble: 7 Ă— 2
## DAY_WEEKNAME Total_Fatalities
## <chr> <int>
## 1 Saturday 6712
## 2 Sunday 6114
## 3 Friday 6026
## 4 Thursday 5221
## 5 Wednesday 5055
## 6 Tuesday 4858
## 7 Monday 4838
fatalities_by_hour <- accidentData %>%
group_by(HOURNAME) %>%
summarise(Total_Fatalities = sum(FATALS, na.rm = TRUE)) %>%
arrange(desc(Total_Fatalities))
cat("\nFatalities by hour of the day:\n")
##
## Fatalities by hour of the day:
print(fatalities_by_hour)
## # A tibble: 25 Ă— 2
## HOURNAME Total_Fatalities
## <chr> <int>
## 1 9:00pm-9:59pm 2357
## 2 6:00pm-6:59pm 2356
## 3 8:00pm-8:59pm 2343
## 4 7:00pm-7:59pm 2314
## 5 5:00pm-5:59pm 2133
## 6 10:00pm-10:59pm 2055
## 7 4:00pm-4:59pm 1997
## 8 3:00pm-3:59pm 1954
## 9 11:00pm-11:59pm 1879
## 10 2:00pm-2:59pm 1734
## # ℹ 15 more rows
It is always important to consider the margin of error, which in this case is 313 over the total.
unknown_hours_count <- sum(accidentData$HOURNAME == "Unknown Hours", na.rm = TRUE)
cat("Number of records with 'Unknown Hours':", unknown_hours_count, "\n")
## Number of records with 'Unknown Hours': 313
In this accident analysis, we found a total of 38,824 fatalities, which highlights a serious issue with road safety. Looking at the days of the week, Saturday is the most dangerous with 6,712 fatalities, followed closely by Sunday with 6,114 and Friday with 6,026. This suggests that weekends, when people tend to go out more, may be linked to a higher risk, possibly due to alcohol consumption.
Regarding the hours of the day, the most critical times are at night, with 9:00 PM to 9:59 PM recording the highest number of fatalities at 2,357, followed by 6:00 PM to 6:59 PM with 2,356 fatalities. This indicates that nighttime is particularly dangerous on the roads.
It is clear that action is needed to improve road safety, especially during weekends and these critical hours.
After analyzing and transforming the dataset, we will focus our analysis on the emergency-related variables.
EMERGENCY SERVICE DIMENSION
Bibliography
accidentData <- accidentData %>%
mutate(
Tiempo_Respuesta = (ARR_HOUR * 60 + ARR_MIN) - (NOT_HOUR * 60 + NOT_MIN),
Tiempo_Hasta_Hospital = (HOSP_HR * 60 + HOSP_MN) - (ARR_HOUR * 60 + ARR_MIN)
)
head(accidentData %>% select(NOT_HOUR, NOT_MIN, ARR_HOUR, ARR_MIN, Tiempo_Respuesta))
## NOT_HOUR NOT_MIN ARR_HOUR ARR_MIN Tiempo_Respuesta
## 1 99 99 3 10 -5849
## 2 17 18 17 26 8
## 3 14 58 15 15 17
## 4 99 99 99 99 0
## 5 0 45 0 55 10
## 6 17 0 17 19 19
We observe that these data contain null values identified as 99, so we will replace them and include them as NA.
accidentData <- accidentData %>%
mutate(
NOT_HOUR = ifelse(NOT_HOUR == 99, NA, NOT_HOUR),
NOT_MIN = ifelse(NOT_MIN == 99, NA, NOT_MIN),
ARR_HOUR = ifelse(ARR_HOUR == 99, NA, ARR_HOUR),
ARR_MIN = ifelse(ARR_MIN == 99, NA, ARR_MIN),
HOSP_HR = ifelse(HOSP_HR == 99, NA, HOSP_HR),
HOSP_MN = ifelse(HOSP_MN == 99, NA, HOSP_MN)
)
hour_columns <- accidentData[, c("NOT_HOUR", "NOT_MIN", "ARR_HOUR", "ARR_MIN", "HOSP_HR", "HOSP_MN")]
na_summary_hours <- colSums(is.na(hour_columns))
na_summary_hours
## NOT_HOUR NOT_MIN ARR_HOUR ARR_MIN HOSP_HR HOSP_MN
## 19520 16888 19724 16967 15201 14967
na_percentage_hours <- (na_summary_hours / nrow(accidentData)) * 100
na_percentage_hours
## NOT_HOUR NOT_MIN ARR_HOUR ARR_MIN HOSP_HR HOSP_MN
## 54.57697 47.21803 55.14735 47.43891 42.50126 41.84701
The percentages are so high that the analysis would be inaccurate. We need to take a different approach, such as analyzing time ranges. However, it is evident that even this cannot be used due to the presence of missing values.
incident_counts <- accidentData %>%
group_by(NOT_HOURNAME) %>%
summarise(Incident_Count = n())
ggplot(incident_counts, aes(x = NOT_HOURNAME, y = Incident_Count)) +
geom_bar(stat = "identity") +
labs(title = "Number of Incidents by Notification Time Range",
x = "Notification Time Range",
y = "Number of Incidents") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
unknown_counts <- accidentData %>%
summarise(
NOT_HOURNAME_Unknown = sum(NOT_HOURNAME == "Unknown", na.rm = TRUE),
ARR_HOURNAME_Unknown = sum(ARR_HOURNAME == "Unknown", na.rm = TRUE),
ARR_MINNAME_Unknown = sum(ARR_MINNAME == "Unknown", na.rm = TRUE)
)
print(unknown_counts)
## NOT_HOURNAME_Unknown ARR_HOURNAME_Unknown ARR_MINNAME_Unknown
## 1 19520 0 0
accidentData <- accidentData %>%
select(-Tiempo_Respuesta, -Tiempo_Hasta_Hospital)
emergency_variables <- accidentData %>%
select(NOT_HOUR, NOT_MIN, ARR_HOUR, ARR_MIN, HOSP_HR, HOSP_MN)
summary_stats <- summary(emergency_variables)
na_counts <- sapply(emergency_variables, function(x) sum(is.na(x)))
summary_with_na <- list(
summary_stats = summary_stats,
na_counts = na_counts
)
print(summary_with_na)
## $summary_stats
## NOT_HOUR NOT_MIN ARR_HOUR ARR_MIN
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 8.00 1st Qu.:18.00 1st Qu.: 8.00 1st Qu.:18.00
## Median :15.00 Median :36.00 Median :15.00 Median :36.00
## Mean :16.21 Mean :41.13 Mean :16.25 Mean :41.42
## 3rd Qu.:19.00 3rd Qu.:54.00 3rd Qu.:19.00 3rd Qu.:54.00
## Max. :88.00 Max. :98.00 Max. :88.00 Max. :98.00
## NA's :19520 NA's :16888 NA's :19724 NA's :16967
## HOSP_HR HOSP_MN
## Min. : 0.00 Min. : 0.00
## 1st Qu.:18.00 1st Qu.:42.00
## Median :88.00 Median :88.00
## Mean :61.76 Mean :67.64
## 3rd Qu.:88.00 3rd Qu.:88.00
## Max. :88.00 Max. :98.00
## NA's :15201 NA's :14967
##
## $na_counts
## NOT_HOUR NOT_MIN ARR_HOUR ARR_MIN HOSP_HR HOSP_MN
## 19520 16888 19724 16967 15201 14967
Emergency Conclusion: There is a significant number of missing values across all variables, especially in NOT_HOUR, ARR_HOUR, and ARR_MIN, where more than 50% of the data is missing. This indicates that alternative variables should be used or that we should check if these data have improved in the 2022 dataset.
path2 = 'accident2022.CSV'
accidentData2022 <- read.csv(path2, row.names=NULL)
structure = str(accidentData2022)
## 'data.frame': 13047 obs. of 80 variables:
## $ STATE : int 1 1 1 1 1 1 1 1 1 1 ...
## $ STATENAME : chr "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ ST_CASE : int 10001 10002 10003 10004 10005 10006 10007 10008 10009 10010 ...
## $ PEDS : int 0 0 0 0 1 1 0 0 0 0 ...
## $ PERNOTMVIT : int 0 0 0 0 1 1 0 0 0 0 ...
## $ VE_TOTAL : int 2 2 1 1 1 1 2 1 2 1 ...
## $ VE_FORMS : int 2 2 1 1 1 1 2 1 2 1 ...
## $ PVH_INVL : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PERSONS : int 3 5 2 1 1 5 1 1 3 1 ...
## $ PERMVIT : int 3 5 2 1 1 5 1 1 3 1 ...
## $ COUNTY : int 107 101 115 101 73 101 63 101 71 131 ...
## $ COUNTYNAME : chr "PICKENS (107)" "MONTGOMERY (101)" "ST. CLAIR (115)" "MONTGOMERY (101)" ...
## $ CITY : int 0 0 0 0 0 2130 0 0 0 0 ...
## $ CITYNAME : chr "NOT APPLICABLE" "NOT APPLICABLE" "NOT APPLICABLE" "NOT APPLICABLE" ...
## $ MONTH : int 1 1 1 1 1 1 1 1 1 1 ...
## $ MONTHNAME : chr "January" "January" "January" "January" ...
## $ DAY : int 1 1 1 2 2 2 4 4 4 5 ...
## $ DAYNAME : int 1 1 1 2 2 2 4 4 4 5 ...
## $ DAY_WEEK : int 7 7 7 1 1 1 3 3 3 4 ...
## $ DAY_WEEKNAME: chr "Saturday" "Saturday" "Saturday" "Sunday" ...
## $ YEAR : int 2022 2022 2022 2022 2022 2022 2022 2022 2022 2022 ...
## $ HOUR : int 12 16 1 14 18 18 9 14 11 0 ...
## $ HOURNAME : chr "12:00pm-12:59pm" "4:00pm-4:59pm" "1:00am-1:59am" "2:00pm-2:59pm" ...
## $ MINUTE : int 30 40 33 46 48 28 5 50 40 0 ...
## $ MINUTENAME : chr "30" "40" "33" "46" ...
## $ TWAY_ID : chr "US-82 SR-6" "US-231 SR-53" "CR-KELLY CREEK RD" "I-65" ...
## $ TWAY_ID2 : chr "" "" "" "" ...
## $ ROUTE : int 2 2 4 1 1 6 4 4 2 3 ...
## $ ROUTENAME : chr "US Highway" "US Highway" "County Road" "Interstate" ...
## $ RUR_URB : int 1 1 1 1 2 2 1 1 1 1 ...
## $ RUR_URBNAME : chr "Rural" "Rural" "Rural" "Rural" ...
## $ FUNC_SYS : int 3 3 5 1 1 4 5 5 3 3 ...
## $ FUNC_SYSNAME: chr "Principal Arterial - Other" "Principal Arterial - Other" "Major Collector" "Interstate" ...
## $ RD_OWNER : int 1 1 2 1 1 4 2 2 1 1 ...
## $ RD_OWNERNAME: chr "State Highway Agency" "State Highway Agency" "County Highway Agency" "State Highway Agency" ...
## $ NHS : int 1 1 0 1 1 0 0 0 1 1 ...
## $ NHSNAME : chr "This section IS ON the NHS" "This section IS ON the NHS" "This section IS NOT on the NHS" "This section IS ON the NHS" ...
## $ SP_JUR : int 0 0 0 0 0 0 0 0 0 0 ...
## $ SP_JURNAME : chr "No Special Jurisdiction" "No Special Jurisdiction" "No Special Jurisdiction" "No Special Jurisdiction" ...
## $ MILEPT : int 4 974 0 1595 1342 0 0 0 1243 52 ...
## $ MILEPTNAME : chr "4" "974" "None" "1595" ...
## $ LATITUDE : num 33.5 32.1 33.4 32.2 33.5 ...
## $ LATITUDENAME: num 33.5 32.1 33.4 32.2 33.5 ...
## $ LONGITUD : num -88.3 -86.1 -86.4 -86.4 -86.7 ...
## $ LONGITUDNAME: num -88.3 -86.1 -86.4 -86.4 -86.7 ...
## $ HARM_EV : int 12 12 42 34 8 8 12 38 12 42 ...
## $ HARM_EVNAME : chr "Motor Vehicle In-Transport" "Motor Vehicle In-Transport" "Tree (Standing Only)" "Ditch" ...
## $ MAN_COLL : int 7 2 0 0 0 0 1 0 6 0 ...
## $ MAN_COLLNAME: chr "Sideswipe - Same Direction" "Front-to-Front" "The First Harmful Event was Not a Collision with a Motor Vehicle in Transport" "The First Harmful Event was Not a Collision with a Motor Vehicle in Transport" ...
## $ RELJCT1 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ RELJCT1NAME : chr "No" "No" "No" "No" ...
## $ RELJCT2 : int 1 1 1 1 1 1 1 1 2 1 ...
## $ RELJCT2NAME : chr "Non-Junction" "Non-Junction" "Non-Junction" "Non-Junction" ...
## $ TYP_INT : int 1 1 1 1 1 1 1 1 2 1 ...
## $ TYP_INTNAME : chr "Not an Intersection" "Not an Intersection" "Not an Intersection" "Not an Intersection" ...
## $ REL_ROAD : int 1 1 4 4 2 1 1 4 1 4 ...
## $ REL_ROADNAME: chr "On Roadway" "On Roadway" "On Roadside" "On Roadside" ...
## $ WRK_ZONE : int 0 0 0 0 0 0 0 0 0 0 ...
## $ WRK_ZONENAME: chr "None" "None" "None" "None" ...
## $ LGT_COND : int 1 1 2 1 2 3 1 1 1 2 ...
## $ LGT_CONDNAME: chr "Daylight" "Daylight" "Dark - Not Lighted" "Daylight" ...
## $ WEATHER : int 1 1 10 10 2 1 1 1 1 1 ...
## $ WEATHERNAME : chr "Clear" "Clear" "Cloudy" "Cloudy" ...
## $ SCH_BUS : int 0 0 0 0 0 0 0 0 0 0 ...
## $ SCH_BUSNAME : chr "No" "No" "No" "No" ...
## $ RAIL : chr "0000000" "0000000" "0000000" "0000000" ...
## $ RAILNAME : chr "Not Applicable" "Not Applicable" "Not Applicable" "Not Applicable" ...
## $ NOT_HOUR : int 12 99 1 14 18 18 99 99 11 0 ...
## $ NOT_HOURNAME: chr "12:00pm-12:59pm" "Unknown" "1:00am-1:59am" "2:00pm-2:59pm" ...
## $ NOT_MIN : int 47 99 33 48 48 26 99 99 36 0 ...
## $ NOT_MINNAME : chr "47" "Unknown" "33" "48" ...
## $ ARR_HOUR : int 13 99 1 15 18 18 99 99 11 0 ...
## $ ARR_HOURNAME: chr "1:00pm-1:59pm" "Unknown EMS Scene Arrival Hour" "1:00am-1:59am" "3:00pm-3:59pm" ...
## $ ARR_MIN : int 4 99 50 9 54 32 99 99 54 33 ...
## $ ARR_MINNAME : chr "4" "Unknown EMS Scene Arrival Minutes" "50" "9" ...
## $ HOSP_HR : int 13 99 99 15 88 99 88 99 12 88 ...
## $ HOSP_HRNAME : chr "1:00pm-1:59pm" "Unknown" "Unknown" "3:00pm-3:59pm" ...
## $ HOSP_MN : int 47 99 99 44 88 99 88 99 41 88 ...
## $ HOSP_MNNAME : chr "47" "Unknown EMS Hospital Arrival Time" "Unknown EMS Hospital Arrival Time" "44" ...
## $ FATALS : int 1 2 1 1 1 1 1 1 1 1 ...
accidentData2022 <- accidentData2022 %>%
mutate(
NOT_HOUR = ifelse(NOT_HOUR == 99, NA, NOT_HOUR),
NOT_MIN = ifelse(NOT_MIN == 99, NA, NOT_MIN),
ARR_HOUR = ifelse(ARR_HOUR == 99, NA, ARR_HOUR),
ARR_MIN = ifelse(ARR_MIN == 99, NA, ARR_MIN),
HOSP_HR = ifelse(HOSP_HR == 99, NA, HOSP_HR),
HOSP_MN = ifelse(HOSP_MN == 99, NA, HOSP_MN)
)
hour_columns <- accidentData2022[, c("NOT_HOUR", "NOT_MIN", "ARR_HOUR", "ARR_MIN", "HOSP_HR", "HOSP_MN")]
na_summary_hours <- colSums(is.na(hour_columns))
print(na_summary_hours)
## NOT_HOUR NOT_MIN ARR_HOUR ARR_MIN HOSP_HR HOSP_MN
## 7165 7723 7481 7973 5787 5752
na_percentage_hours <- (na_summary_hours / nrow(accidentData2022)) * 100
print(na_percentage_hours)
## NOT_HOUR NOT_MIN ARR_HOUR ARR_MIN HOSP_HR HOSP_MN
## 54.91684 59.19368 57.33885 61.10983 44.35502 44.08676
During our data exploration and upon revisiting the documentation, we found that 88 corresponds to “Not Applicable or Not Notified”. We will include this value in the percentage calculations.
accidentData2022 <- accidentData2022 %>%
mutate(
NOT_HOUR = ifelse(NOT_HOUR %in% c(99, 88), NA, NOT_HOUR),
NOT_MIN = ifelse(NOT_MIN %in% c(99, 88), NA, NOT_MIN),
ARR_HOUR = ifelse(ARR_HOUR %in% c(99, 88), NA, ARR_HOUR),
ARR_MIN = ifelse(ARR_MIN %in% c(99, 88), NA, ARR_MIN),
HOSP_HR = ifelse(HOSP_HR %in% c(99, 88), NA, HOSP_HR),
HOSP_MN = ifelse(HOSP_MN %in% c(99, 88), NA, HOSP_MN)
)
hour_columns <- accidentData2022[, c("NOT_HOUR", "NOT_MIN", "ARR_HOUR", "ARR_MIN", "HOSP_HR", "HOSP_MN")]
na_summary_hours <- colSums(is.na(hour_columns))
print(na_summary_hours)
## NOT_HOUR NOT_MIN ARR_HOUR ARR_MIN HOSP_HR HOSP_MN
## 7224 7782 7540 8032 11011 10976
na_percentage_hours <- (na_summary_hours / nrow(accidentData2022)) * 100
print(na_percentage_hours)
## NOT_HOUR NOT_MIN ARR_HOUR ARR_MIN HOSP_HR HOSP_MN
## 55.36905 59.64590 57.79106 61.56204 84.39488 84.12662
The percentage values remain very high, even when using a dataset from more recent years. This indicates that the focus should be on data retention, as there are many unknown values.
An attempt is made to perform multiple imputation based on the following reference:
We begin the multiple imputation process by applying the technique described in the reference, focusing on the relevant columns. First, we replace the values 99 and 88, which are considered unknown data, with NA to facilitate imputation.
We use the mice function to perform the imputation, setting the number of imputations to 30. This ensures that the missing data in the notification and emergency arrival columns are adequately addressed.
Finally, visualizations are generated to analyze the relationships between notification and arrival times, allowing for a deeper examination of the efficiency of the emergency response system.
columns <- c("NOT_HOUR", "NOT_MIN", "ARR_HOUR", "ARR_MIN", "HOSP_HR", "HOSP_MN")
accidentData <- accidentData %>%
mutate(across(all_of(columns), ~ replace(., . %in% c(99, 88), NA)))
set.seed(2018)
imputed_data3 <- mice(accidentData[, names(accidentData) %in% columns], m = 30, print = FALSE)
complete.data3 <- mice::complete(imputed_data3)
xyplot(imputed_data3, ARR_HOUR ~ NOT_HOUR)
We will create a new dataset with the imputed values.
complete.data3 <- mice::complete(imputed_data3)
for (column in columns) {
accidentData[[column]] <- complete.data3[[column]]
}
accidentData_imputed <- accidentData
head(accidentData_imputed)
## STATE STATENAME ST_CASE VE_TOTAL VE_FORMS PVH_INVL PEDS PERSONS PERMVIT
## 1 1 Alabama 10001 1 1 0 0 4 4
## 2 1 Alabama 10002 4 4 0 0 6 6
## 3 1 Alabama 10003 2 2 0 0 2 2
## 4 1 Alabama 10004 1 1 0 0 5 5
## 5 1 Alabama 10005 1 1 0 0 1 1
## 6 1 Alabama 10006 2 2 0 0 3 3
## PERNOTMVIT COUNTY COUNTYNAME CITY CITYNAME DAY DAYNAME MONTH
## 1 0 51 ELMORE (51) 0 NOT APPLICABLE 1 1 1
## 2 0 73 JEFFERSON (73) 350 BIRMINGHAM 2 2 1
## 3 0 117 SHELBY (117) 0 NOT APPLICABLE 2 2 1
## 4 0 15 CALHOUN (15) 0 NOT APPLICABLE 3 3 1
## 5 0 37 COOSA (37) 0 NOT APPLICABLE 4 4 1
## 6 0 103 MORGAN (103) 0 NOT APPLICABLE 4 4 1
## MONTHNAME YEAR DAY_WEEK DAY_WEEKNAME HOUR HOURNAME MINUTE MINUTENAME NHS
## 1 January 2020 4 Wednesday 2 2:00am-2:59am 58 58 0
## 2 January 2020 5 Thursday 17 5:00pm-5:59pm 18 18 0
## 3 January 2020 5 Thursday 14 2:00pm-2:59pm 55 55 0
## 4 January 2020 6 Friday 15 3:00pm-3:59pm 20 20 0
## 5 January 2020 7 Saturday 0 0:00am-0:59am 45 45 0
## 6 January 2020 7 Saturday 16 4:00pm-4:59pm 55 55 0
## NHSNAME ROUTE ROUTENAME
## 1 This section IS NOT on the NHS 4 County Road
## 2 This section IS NOT on the NHS 6 Local Street - Municipality
## 3 This section IS NOT on the NHS 3 State Highway
## 4 This section IS NOT on the NHS 4 County Road
## 5 This section IS NOT on the NHS 4 County Road
## 6 This section IS NOT on the NHS 3 State Highway
## TWAY_ID TWAY_ID2 RUR_URB RUR_URBNAME FUNC_SYS
## 1 cr-4 notspecify 1 Rural 5
## 2 martin luther king jr dr notspecify 2 Urban 4
## 3 sr-76 us-280 1 Rural 4
## 4 CR-ALEXANDRIA WELLINGTON RD notspecify 1 Rural 7
## 5 CR-63 notspecify 1 Rural 5
## 6 sr-36 notspecify 1 Rural 4
## FUNC_SYSNAME RD_OWNER RD_OWNERNAME MILEPT MILEPTNAME
## 1 Major Collector 2 County Highway Agency 0 None
## 2 Minor Arterial 4 City or Municipal Highway Agency 0 None
## 3 Minor Arterial 1 State Highway Agency 49 49
## 4 Local 2 County Highway Agency 0 None
## 5 Major Collector 2 County Highway Agency 0 None
## 6 Minor Arterial 1 State Highway Agency 390 390
## LATITUDE LATITUDENAME LONGITUD LONGITUDNAME SP_JUR SP_JURNAME
## 1 32.43313 32.43313333 -86.09485 -86.09485 0 No Special Jurisdiction
## 2 33.48466 33.48465833 -86.83954 -86.83954444 0 No Special Jurisdiction
## 3 33.29994 33.29994167 -86.36964 -86.36964167 0 No Special Jurisdiction
## 4 33.79507 33.79507222 -85.88349 -85.88348611 0 No Special Jurisdiction
## 5 32.84841 32.84841389 -86.08355 -86.08354722 0 No Special Jurisdiction
## 6 34.50894 34.50894167 -86.67486 -86.67485556 0 No Special Jurisdiction
## HARM_EV HARM_EVNAME MAN_COLL
## 1 42 Tree (Standing Only) 0
## 2 12 Motor Vehicle In-Transport 6
## 3 34 Ditch 0
## 4 42 Tree (Standing Only) 0
## 5 42 Tree (Standing Only) 0
## 6 12 Motor Vehicle In-Transport 2
## MAN_COLLNAME
## 1 The First Harmful Event was Not a Collision with a Motor Vehicle in Transport
## 2 Angle
## 3 The First Harmful Event was Not a Collision with a Motor Vehicle in Transport
## 4 The First Harmful Event was Not a Collision with a Motor Vehicle in Transport
## 5 The First Harmful Event was Not a Collision with a Motor Vehicle in Transport
## 6 Front-to-Front
## RELJCT1 RELJCT1NAME RELJCT2 RELJCT2NAME TYP_INT TYP_INTNAME
## 1 0 No 1 Non-Junction 1 Not an Intersection
## 2 0 No 1 Non-Junction 1 Not an Intersection
## 3 0 No 3 Intersection-Related 3 T-Intersection
## 4 0 No 1 Non-Junction 1 Not an Intersection
## 5 0 No 1 Non-Junction 1 Not an Intersection
## 6 0 No 1 Non-Junction 1 Not an Intersection
## WRK_ZONE WRK_ZONENAME REL_ROAD REL_ROADNAME LGT_COND LGT_CONDNAME
## 1 0 None 4 On Roadside 2 Dark - Not Lighted
## 2 0 None 1 On Roadway 3 Dark - Lighted
## 3 0 None 4 On Roadside 1 Daylight
## 4 0 None 4 On Roadside 1 Daylight
## 5 0 None 4 On Roadside 2 Dark - Not Lighted
## 6 0 None 1 On Roadway 2 Dark - Not Lighted
## WEATHER WEATHERNAME SCH_BUS SCH_BUSNAME RAIL RAILNAME NOT_HOUR
## 1 1 Clear 0 No 0000000 Not Applicable 3
## 2 2 Rain 0 No 0000000 Not Applicable 17
## 3 2 Rain 0 No 0000000 Not Applicable 14
## 4 10 Cloudy 0 No 0000000 Not Applicable 18
## 5 2 Rain 0 No 0000000 Not Applicable 0
## 6 1 Clear 0 No 0000000 Not Applicable 17
## NOT_HOURNAME NOT_MIN NOT_MINNAME ARR_HOUR ARR_HOURNAME
## 1 Unknown 12 Unknown 3 3:00am-3:59am
## 2 5:00pm-5:59pm 18 18 17 5:00pm-5:59pm
## 3 2:00pm-2:59pm 58 58 15 3:00pm-3:59pm
## 4 Unknown 43 Unknown 18 Unknown EMS Scene Arrival Hour
## 5 0:00am-0:59am 45 45 0 0:00am-0:59am
## 6 5:00pm-5:59pm 0 0 17 5:00pm-5:59pm
## ARR_MIN ARR_MINNAME HOSP_HR
## 1 10 10 3
## 2 26 26 17
## 3 15 15 15
## 4 47 Unknown EMS Scene Arrival Minutes 19
## 5 55 55 2
## 6 19 19 18
## HOSP_HRNAME HOSP_MN HOSP_MNNAME
## 1 Unknown 47 Unknown EMS Hospital Arrival Time
## 2 Unknown 12 Unknown EMS Hospital Arrival Time
## 3 Unknown 21 Unknown EMS Hospital Arrival Time
## 4 Unknown 22 Unknown EMS Hospital Arrival Time
## 5 Not Applicable (Not Transported) 8 Not Applicable (Not Transported)
## 6 6:00pm-6:59pm 51 51
## FATALS DRUNK_DR
## 1 3 1
## 2 1 0
## 3 1 0
## 4 1 0
## 5 1 0
## 6 1 0
The histogram graph shows the distribution of the emergency notification hour (NOT_HOUR).
Next, we sum the frequencies to confirm (although it is clearly visible in the graph) that 8:00 PM is the hour with the highest number of notifications.
ggplot(accidentData_imputed, aes(x = NOT_HOUR)) +
geom_histogram(binwidth = 1, fill = "blue", alpha = 0.7) +
labs(title = "Distribution of Notification Hour",
x = "Notification Hour",
y = "Frequency")
frecuencia_not_hour <- table(accidentData_imputed$NOT_HOUR)
frecuencia_not_hour_df <- as.data.frame(frecuencia_not_hour)
names(frecuencia_not_hour_df) <- c("Hora", "Frecuencia")
frecuencia_maxima <- frecuencia_not_hour_df[which.max(frecuencia_not_hour_df$Frecuencia), ]
frecuencia_maxima
## Hora Frecuencia
## 21 20 2333
Now, we create another graph to visualize fatalities by notification hour.
At first glance, we can confirm that the hour with the most notifications (8:00 PM) is also the one with the highest number of fatalities.
ggplot(accidentData_imputed %>%
group_by(NOT_HOUR) %>%
summarise(Total_Fatalities = sum(FATALS, na.rm = TRUE)),
aes(x = NOT_HOUR, y = Total_Fatalities)) +
geom_bar(stat = "identity", fill = "red", alpha = 0.7) +
labs(title = "Total Fatalities by Notification Hour",
x = "Notification Hour",
y = "Total Fatalities") +
theme_minimal()
Now, we will create the same graph but based on time ranges.
Although we could anticipate that the number of fatalities is higher in the afternoon, this plot provides a more complete visualization, suggesting that the later it gets, the higher the number of fatalities.
accidentData_imputed <- accidentData_imputed %>%
mutate(Time_Range = case_when(
NOT_HOUR >= 0 & NOT_HOUR < 6 ~ "00-06",
NOT_HOUR >= 6 & NOT_HOUR < 12 ~ "06-12",
NOT_HOUR >= 12 & NOT_HOUR < 18 ~ "12-18",
TRUE ~ "18-24"
))
fatalities_by_time_range <- accidentData_imputed %>%
group_by(Time_Range) %>%
summarise(Total_Fatalities = sum(FATALS, na.rm = TRUE))
ggplot(fatalities_by_time_range, aes(x = Time_Range, y = Total_Fatalities)) +
geom_bar(stat = "identity", fill = "purple", alpha = 0.7) +
labs(title = "Total Fatalities by Time Range",
x = "Time Range",
y = "Total Fatalities") +
theme_minimal()
We perform a correlation analysis as a confirmation and determine that there is a relationship between the notification hour and the number of fatalities.
muertes_por_hora <- accidentData_imputed %>%
group_by(NOT_HOUR) %>%
summarise(Total_Muertes = sum(FATALS, na.rm = TRUE))
correlacion <- cor(muertes_por_hora$NOT_HOUR, muertes_por_hora$Total_Muertes, use = "complete.obs")
print(correlacion)
## [1] 0.8305662
We create a regression model to study how the notification hour (NOT_HOUR) is related to the total number of fatalities (Total_Fatalities).
The results indicate that fatalities tend to increase significantly as the day progresses, with a coefficient of 0.0347 per hour.
modelo <- glm(Total_Muertes ~ NOT_HOUR, data = muertes_por_hora, family = "poisson")
summary(modelo)
##
## Call:
## glm(formula = Total_Muertes ~ NOT_HOUR, family = "poisson", data = muertes_por_hora)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.9608730 0.0110399 630.52 <2e-16 ***
## NOT_HOUR 0.0347102 0.0007459 46.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 3206.6 on 23 degrees of freedom
## Residual deviance: 1003.5 on 22 degrees of freedom
## AIC: 1227.9
##
## Number of Fisher Scoring iterations: 4
ggplot(muertes_por_hora, aes(x = NOT_HOUR, y = Total_Muertes)) +
geom_point(color = "blue") +
geom_smooth(method = "glm", method.args = list(family = "poisson"), color = "red") +
labs(title = "Relationship Between Notification Hour and Total Fatalities",
x = "Time of NotificaciĂłn",
y = "Total of deaths") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
The analysis of the relationship between emergency notification hour (NOT_HOUR) and the number of fatalities (Total_Fatalities) provides insights into how these situations unfold. Using a regression model, we found a significant connection: for every hour that passes, the probability of fatalities increases by approximately 3.53%. This result is strong (p-value less than 2e-16), indicating that it is not due to chance.
Examining the data, we observe a cyclical pattern in the hours of the day. For example, the number of fatalities at 11:00 PM and 12:00 AM is similar, as both hours are part of the same daily transition. This finding highlights that, even as the day changes, emergency activity continues.
The majority of fatalities occur between 6:00 PM and 12:00 AM, which may be related to increased public and private activity during those hours. This suggests that emergency services should be prepared for a higher number of incidents during these critical periods.
To improve emergency response, it is essential to adjust resources and staffing based on these demand peaks. Increasing the number of ambulances and medical personnel during these key hours could make a significant difference in response times and potentially reduce the number of fatalities.
In this analysis, we focus on examining weather-related variables to identify potential unregistered values in the accident data. It was observed that only 0.72% of the records contain unrecognized values, a percentage so low that it is considered insignificant and will not affect the validity of our analysis.
total_registros <- nrow(accidentData)
clima_frecuencia <- accidentData %>%
group_by(WEATHERNAME) %>%
summarise(Frecuencia = n()) %>%
mutate(Porcentaje = (Frecuencia / total_registros) * 100) %>%
arrange(desc(Frecuencia))
print(clima_frecuencia)
## # A tibble: 13 Ă— 3
## WEATHERNAME Frecuencia Porcentaje
## <chr> <int> <dbl>
## 1 Clear 24963 69.8
## 2 Cloudy 4622 12.9
## 3 Rain 2634 7.36
## 4 Not Reported 2461 6.88
## 5 Fog, Smog, Smoke 370 1.03
## 6 Snow 283 0.791
## 7 Reported as Unknown 261 0.730
## 8 Severe Crosswinds 56 0.157
## 9 Freezing Rain or Drizzle 39 0.109
## 10 Blowing Snow 26 0.0727
## 11 Sleet or Hail 26 0.0727
## 12 Other 20 0.0559
## 13 Blowing Sand, Soil, Dirt 5 0.0140
ggplot(clima_frecuencia, aes(x = reorder(WEATHERNAME, -Frecuencia), y = Frecuencia)) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(title = "Accident Frequency by Weather Conditions",
x = "Weather Conditions",
y = "Accident Frequency") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
To evaluate the frequency of different weather conditions, we performed a count of observations in each category of the WEATHERNAME variable. The results revealed the following frequencies and associated percentages:
Observations
The data show that the majority of accidents (69.8%) occurred under clear conditions, which suggests that weather conditions are not always a determining factor in accident severity. However, rainy and cloudy conditions also constitute a significant portion of the records, indicating the need for further investigation into the relationship between these conditions and accident severity.
The presence of a considerable percentage of records classified as “Not Reported” (6.88%) also highlights the importance of improving data collection regarding weather conditions in accident reports.
At this point, we proceed to normalize the variables. As demonstrated in the example exercise, we use the following code to normalize the variables FATALS and PERNOTMVIT, allowing us to scale all values to a common range between 0 and 1, facilitating comparison and further analysis.
path = 'accident.CSV'
accidentData <- read.csv(path, row.names=NULL)
accidentData <- accidentData %>%
select(FATALS, PERNOTMVIT, WEATHERNAME)
nor <- function(x) {(x - min(x)) / (max(x) - min(x))}
accidentData_nor <- accidentData %>%
mutate(FATALS = nor(FATALS), PERNOTMVIT = nor(PERNOTMVIT))
accidentData_dummies <- accidentData_nor %>%
bind_cols(model.matrix(~WEATHERNAME - 1, data = .)) %>%
select(-WEATHERNAME)
head(accidentData_dummies)
## FATALS PERNOTMVIT WEATHERNAMEBlowing Sand, Soil, Dirt
## 1 0.2857143 0 0
## 2 0.0000000 0 0
## 3 0.0000000 0 0
## 4 0.0000000 0 0
## 5 0.0000000 0 0
## 6 0.0000000 0 0
## WEATHERNAMEBlowing Snow WEATHERNAMEClear WEATHERNAMECloudy
## 1 0 1 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 1
## 5 0 0 0
## 6 0 1 0
## WEATHERNAMEFog, Smog, Smoke WEATHERNAMEFreezing Rain or Drizzle
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## WEATHERNAMENot Reported WEATHERNAMEOther WEATHERNAMERain
## 1 0 0 0
## 2 0 0 1
## 3 0 0 1
## 4 0 0 0
## 5 0 0 1
## 6 0 0 0
## WEATHERNAMEReported as Unknown WEATHERNAMESevere Crosswinds
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## WEATHERNAMESleet or Hail WEATHERNAMESnow
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
We perform a Principal Component Analysis (PCA) to reduce the dimensionality of the dataset and explore its underlying structure.
The code executes the prcomp function, centering and scaling the data, which allows the variables to be analyzed on the same scale. Then, we summarize the results to obtain information on the variance explained by each principal component.
The variance proportion indicates how much information each component retains, facilitating the identification of the most relevant components.
Finally, we generate a histogram of the explained variance to visualize how the variance is distributed across the principal components.
pca.acc <- prcomp(accidentData_dummies, center = TRUE, scale. = TRUE)
summary(pca.acc)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.3313 1.05582 1.0400 1.03116 1.00778 1.00487 1.00262
## Proportion of Variance 0.1182 0.07432 0.0721 0.07089 0.06771 0.06732 0.06702
## Cumulative Proportion 0.1182 0.19247 0.2646 0.33546 0.40317 0.47049 0.53751
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 1.00125 1.00055 1.00040 1.00036 1.00011 0.99842 0.96708
## Proportion of Variance 0.06683 0.06674 0.06672 0.06671 0.06668 0.06646 0.06235
## Cumulative Proportion 0.60434 0.67108 0.73780 0.80451 0.87119 0.93765 1.00000
## PC15
## Standard deviation 1.802e-13
## Proportion of Variance 0.000e+00
## Cumulative Proportion 1.000e+00
ev <- get_eig(pca.acc)
fviz_eig(pca.acc)
var_acc <- pca.acc$sdev^2
head(var_acc)
## [1] 1.772332 1.114761 1.081545 1.063291 1.015628 1.009763
num_components <- sum(var_acc > 1)
var <- get_pca_var(pca.acc)
head(var$coord[, 1:num_components], 11)
## Dim.1 Dim.2 Dim.3
## FATALS 0.02188424 -0.035643784 -0.0780970227
## PERNOTMVIT -0.06739766 -0.045805788 0.3018008689
## WEATHERNAMEBlowing Sand, Soil, Dirt 0.01547945 0.008151962 0.0004163935
## WEATHERNAMEBlowing Snow 0.03573920 0.020545567 -0.0162880978
## WEATHERNAMEClear -0.99765490 -0.045559954 -0.0089981179
## WEATHERNAMECloudy 0.62580775 -0.767977195 0.0210824022
## WEATHERNAMEFog, Smog, Smoke 0.13626324 0.065226551 -0.0162136833
## WEATHERNAMEFreezing Rain or Drizzle 0.04400598 0.021799281 -0.0264352132
## WEATHERNAMENot Reported 0.39575512 0.480776081 -0.6953529818
## WEATHERNAMEOther 0.03096916 0.013251108 0.0030929785
## WEATHERNAMERain 0.41168889 0.524753223 0.7011149348
## Dim.4 Dim.5 Dim.6
## FATALS -0.68881258 0.09145551 0.097678349
## PERNOTMVIT 0.61401332 -0.06653455 0.054083991
## WEATHERNAMEBlowing Sand, Soil, Dirt 0.02787652 -0.04782031 -0.044492080
## WEATHERNAMEBlowing Snow 0.01886354 -0.09276140 -0.135728459
## WEATHERNAMEClear -0.02709641 0.03094815 0.004623439
## WEATHERNAMECloudy 0.03948947 0.11889687 0.019350438
## WEATHERNAMEFog, Smog, Smoke -0.08326557 -0.72437432 0.646646403
## WEATHERNAMEFreezing Rain or Drizzle -0.07790375 -0.05813001 -0.056411678
## WEATHERNAMENot Reported 0.25588567 0.19516531 0.069410173
## WEATHERNAMEOther 0.00370126 -0.06696078 -0.005676920
## WEATHERNAMERain -0.13331996 0.18854710 0.020630338
## Dim.7 Dim.8 Dim.9
## FATALS -0.144531314 -0.0798447092 -0.021804416
## PERNOTMVIT 0.048151167 -0.1020214391 0.003238107
## WEATHERNAMEBlowing Sand, Soil, Dirt 0.081663587 0.1431814086 0.084454508
## WEATHERNAMEBlowing Snow 0.161342725 0.7024325901 0.457615332
## WEATHERNAMEClear -0.003044417 0.0006871405 -0.001197626
## WEATHERNAMECloudy 0.004533127 -0.0046146431 -0.002785924
## WEATHERNAMEFog, Smog, Smoke 0.149256576 -0.0070618034 -0.002614266
## WEATHERNAMEFreezing Rain or Drizzle -0.118747827 0.4573513883 -0.774488240
## WEATHERNAMENot Reported -0.013993173 -0.0508926548 -0.007938861
## WEATHERNAMEOther -0.010150974 -0.0632117856 0.035332462
## WEATHERNAMERain 0.002423523 0.0054484870 -0.003969482
## Dim.10 Dim.11 Dim.12
## FATALS 0.012718192 0.0009081905 0.019795339
## PERNOTMVIT -0.027968519 -0.0219824936 -0.012842878
## WEATHERNAMEBlowing Sand, Soil, Dirt -0.101496072 -0.0781297273 -0.947091457
## WEATHERNAMEBlowing Snow -0.326791617 0.2086704105 0.230015976
## WEATHERNAMEClear 0.003474573 0.0037847794 0.001921345
## WEATHERNAMECloudy 0.009312143 0.0121413165 0.004556120
## WEATHERNAMEFog, Smog, Smoke 0.028255107 0.0448464866 0.008430521
## WEATHERNAMEFreezing Rain or Drizzle -0.249976703 0.0321594655 -0.048551730
## WEATHERNAMENot Reported 0.008981734 0.0119421568 0.007537403
## WEATHERNAMEOther -0.470447865 -0.8573276481 0.142215880
## WEATHERNAMERain 0.015634162 0.0189832748 0.007401653
var_acc <- pca.acc$sdev^2
proporciones_varianza <- var_acc / sum(var_acc)
scree_data <- data.frame(
Component = 1:length(var_acc),
Variance = proporciones_varianza
)
ggplot(scree_data, aes(x = Component, y = Variance)) +
geom_line() +
geom_point() +
labs(title = "Scree Plot",
x = "Component Number",
y = "Proportion of Explained Variance") +
theme_minimal()
PCA Analysis
The Principal Component Analysis (PCA) applied to the accident data reveals that the first two principal components explain approximately 19% of the total variance, suggesting that a significant portion of the information can be effectively represented in a lower-dimensional space. Additionally, the proportion of variance explained by the first six principal components (PC1 to PC6) accounts for more than 47% of the total variance.
Notably, the first principal component (PC1) has a high negative loading on WEATHERNAMEClear, suggesting that under clear weather conditions, more fatalities occur in accidents. The second component (PC2) shows a positive relationship with PERNOTMVIT and WEATHERNAMERain, indicating that rain may increase accident severity. Meanwhile, the third component (PC3) presents a complex combination of WEATHERNAMEFog, Smog, Smoke, and WEATHERNAMERain, suggesting that adverse weather conditions may be associated with an increased number of fatal accidents.
The preliminary analysis, which reveals that 69.8% of non-motorized vehicle accidents occur under clear weather conditions, suggests important implications for road safety and driver behavior. This finding highlights that while clear conditions are considered safe for driving, accidents still frequently occur. Many non-motorized vehicle users may assume that good weather ensures greater safety, potentially leading to riskier behaviors such as speeding or inattention to road hazards.
On the other hand, the fact that adverse weather conditions such as fog, smoke, or snow account for less than 2% of recorded accidents suggests that, although rare, these situations may significantly impact accident severity. It is possible that drivers are more cautious in these conditions, resulting in fewer accidents overall. However, when they do occur, these accidents tend to be more severe due to factors such as reduced visibility and loss of vehicle control.
The PCA analysis performed on the normalized dataset yielded interesting results regarding the relationship between weather conditions and accidents. It was observed that the first six principal components (PC1 to PC6) explain more than 47% of the total variance, indicating that a relatively small number of variables capture most of the information in the dataset.
The first component (PC1) is notable for its high negative loading on clear weather conditions (WEATHERNAMEClear). This suggests that, contrary to expectations, good weather conditions are associated with a higher number of fatalities in accidents. This finding is surprising, as good weather is typically linked to safer driving conditions.
The second component (PC2) shows a positive relationship with non-motorized vehicle variables (PERNOTMVIT) and rain conditions (WEATHERNAMERain). This indicates that rain can contribute to more severe accidents, emphasizing the need for drivers to exercise greater caution in such conditions.
Finally, the third component (PC3) presents a combination of complex weather conditions, such as fog, smog, and smoke, along with rain. This suggests that when weather conditions worsen, accidents become more hazardous. These results underline the importance of road safety awareness, especially on days with adverse meteorological conditions where visibility and vehicle control may be compromised.
To conclude, the fact that most fatal accidents occur on clear days may be linked to increased traffic volume. When the weather is good, more people tend to drive, increasing the likelihood of collisions. Additionally, favorable weather may lead some drivers to become overconfident, resulting in risky behaviors such as speeding, under the false assumption that they are safer.
Conversely, accidents that occur in the rain tend to be more severe. This may be due to the fact that rain makes roads slippery and reduces visibility. Consequently, drivers’ reaction times increase, and vehicles can skid more easily, leading to hazardous situations. The combination of rain and high speeds is particularly risky, especially when drivers fail to adjust their driving behavior to the weather conditions.
Regarding situations such as fog or snow, although they occur less frequently, when they do happen, the impact can be more severe. These conditions drastically reduce visibility and may lead to multi-vehicle accidents, particularly on congested or high-speed roads.
In general, it is crucial for all drivers to consider weather conditions and adjust their driving style accordingly, regardless of whether the weather appears to be perfect or not.