Building on the work already completed within the framework of the Data Mining course, taken during the second semester of the 2023-2024 academic year in the Master of Data Science program at the Universitat Oberta de Catalunya (UOC), and with both the first and second parts available on the RPubs repository, I aim to improve the existing modeling and select the most suitable model for imputing the dependent variable “Es_mon_treball”. Finally, I will also prepare the data by adding their respective geocoordinates so that they can be used in geospatial data analysis software. Since some of the tasks have already been presented and discussed in the previous two parts, I will avoid repeating sections with code that have already been detailed in the two aforementioned articles.
In this article, we will perform the tasks of standardization, cleaning, and coding for the dataset corresponding to those drivers whose reason for travel was recorded as “Es desconeix” - “Unknown”- in the files opened by the Barcelona Local Policia or Guàrdia Urbana. Given the repetitive nature of this section compared to what was detailed in the first part, it will be avoided to repeat it again.
maskDistDescon <- dades$Nom_districte == "Desconegut"
dadesDistDescon <- dades[maskDistDescon, ]
dades_ <- dades[!maskDistDescon, ]
maskDistDescon <- dadesDesconegut$Nom_districte == "Desconegut"
dadesDistDescon <- dadesDesconegut[maskDistDescon, ]
dadesDesconegut_ <- dadesDesconegut[!maskDistDescon, ]
nMotiuConegut <- nrow(dades_)
nMotiuDesconegut <- nrow(dadesDesconegut_)
nConductors <- nMotiuConegut + nMotiuDesconegut
propMotiuConegut <- round((nMotiuConegut /nConductors) * 100, 3)
propMotiuDesconegut <- round((nMotiuDesconegut /nConductors) * 100, 3)
print(glue("A total of {nConductors} drivers are considered, of whom {nMotiuConegut} ({propMotiuConegut}%) have a known reason for their travel, while for {nMotiuDesconegut} ({propMotiuDesconegut}%), the reason for their travel is unknown."))
## A total of 5019 drivers are considered, of whom 2823 (56.246%) have a known reason for their travel, while for 2196 (43.754%), the reason for their travel is unknown.
In contrast, it is relevant to note that, unlike the coding task performed in the second part for the variable “Es_ocupacional,” which identifies whether the time of the accident occurred during a period with a high proportion of work-related travel on weekdays, I will now update the data with the results from the 2023 Weekday Mobility Survey - Enquesta de Mobilitat en dia feiner- corresponding to 2023. This survey provides information on time slots and reasons for travel (INSTITUT METRÒPOLI, 2023: 49). In contrast to the conclusions of the aforementioned study, we find the number of work-related trips between 5:00 and 16:00 to be significant.
dades_["Es_ocupacional"] <- "No"
horaOcupacional <- c(5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
maskHorari <- dades_$Hora_dia %in% horaOcupacional
maskLaborable <- dades_$Es_laborable == "Si"
dades_$Es_ocupacional[maskHorari & maskLaborable] <- "Si"
aggEs_ocupacional <- aggregate(dades_$Numero_expedient,
by=dades_["Es_ocupacional"], FUN=length)
ggplot(aggEs_ocupacional, aes(x="", y=x, fill=Es_ocupacional)) +
geom_bar(width = 1, stat = "identity", color = "black") +
ggtitle(glue("Fig. 1.1. Stacked Bar Chart of Driver Count Based on
Whether the Accident Occurred During Occupational Hours and
the Reason for Travel is Known.")) +
xlab(glue("Reason for Travel is Known")) +
ylab("Count")
nAccidents <- length(unique(dades_$Numero_expedient))
nNo_Es_ocupacional <- aggEs_ocupacional[1, 2]
propNo_Es_ocupacional <- round((nNo_Es_ocupacional / nMotiuConegut) * 100, 3)
nSi_Es_ocupacional <- aggEs_ocupacional[2, 2]
propSi_Es_ocupacional <- round((nSi_Es_ocupacional / nMotiuConegut) * 100, 3)
print(glue("Of the subset of drivers for whom the reason for their travel is known, {nSi_Es_ocupacional} ({propSi_Es_ocupacional}%) were traveling during occupational hours, while {nNo_Es_ocupacional} ({propNo_Es_ocupacional}%) were not, with a total of {nAccidents} accidents recorded."))
## Of the subset of drivers for whom the reason for their travel is known, 1272 (45.058%) were traveling during occupational hours, while 1551 (54.942%) were not, with a total of 2561 accidents recorded.
dadesDesconegut_["Es_ocupacional"] <- "No"
horaOcupacional <- c(5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
maskHorari <- dadesDesconegut_$Hora_dia %in% horaOcupacional
maskLaborable <- dadesDesconegut_$Es_laborable == "Si"
dadesDesconegut_$Es_ocupacional[maskHorari & maskLaborable] <- "Si"
aggEs_ocupacional <- aggregate(dadesDesconegut_$Numero_expedient,
by=dadesDesconegut_["Es_ocupacional"], FUN=length)
ggplot(aggEs_ocupacional, aes(x="", y=x, fill=Es_ocupacional)) +
geom_bar(width = 1, stat = "identity", color = "black") +
ggtitle(glue("Fig. 1.2. Stacked Bar Chart of Driver Count Based on
Whether the Accident Occurred During Occupational Hours and
the Reason for Travel is Unknown.")) +
xlab(glue("Reason for Travel is Unknown")) +
ylab("Count")
nAccidents <- length(unique(dadesDesconegut_$Numero_expedient))
nNo_Es_ocupacional <- aggEs_ocupacional[1, 2]
propNo_Es_ocupacional <- round((nNo_Es_ocupacional / nMotiuDesconegut) * 100, 3)
nSi_Es_ocupacional <- aggEs_ocupacional[2, 2]
propSi_Es_ocupacional <- round((nSi_Es_ocupacional / nMotiuDesconegut) * 100, 3)
print(glue("For the drivers with an unknown travel reason, {nSi_Es_ocupacional} ({propSi_Es_ocupacional}%) were traveling during occupational hours, while {nNo_Es_ocupacional} ({propNo_Es_ocupacional}%) were not, with a total of {nAccidents} accidents recorded."))
## For the drivers with an unknown travel reason, 1150 (52.368%) were traveling during occupational hours, while 1046 (47.632%) were not, with a total of 2029 accidents recorded.
Given the observable result in Figures 1.1 and 1.2, it is noted that the proportion of drivers injured in accidents that occurred during an occupational time slot or not reverses depending on whether the reason for their travel is known or not.
We also check for any relevant changes in the class sizes in the victimization description variable, as shown, since in Figures 15.1 and 15.2 of the first part, we already verified that the territorial distribution was not affected by data cleaning:
dades_["Victimitzacio_est_"] <- ""
dades_$Victimitzacio_est_ <- sapply(strsplit(dades_$Descripcio_victimitzacio,
": "), "[", 2)
maskMorts <- grepl("Mort", dades_$Descripcio_victimitzacio)
dades_$Victimitzacio_est_[maskMorts] <- "Mort"
dades_$Victimitzacio_est_ <- str_to_sentence(dades_$Victimitzacio_est_)
aggVictmitzacio <- aggregate(dades_$Numero_expedient,
by=dades_["Victimitzacio_est_"],
FUN=length)
ggplot(aggVictmitzacio, aes(x="", y=x, fill=Victimitzacio_est_)) +
geom_bar(width = 1, stat = "identity", color = "black") +
ggtitle(glue("Fig. 2.1. Stacked Bar Chart of Driver Count by Type of Victimization
for Those Whose Reason for Travel Was Known.")) +
xlab(glue("Reason for Travel is Known")) +
ylab("Count")
dadesDesconegut_["Victimitzacio_est_"] <- ""
dadesDesconegut_$Victimitzacio_est_ <- sapply(strsplit(dadesDesconegut_$Descripcio_victimitzacio,
": "), "[", 2)
maskMorts <- grepl("Mort", dadesDesconegut_$Descripcio_victimitzacio)
dadesDesconegut_$Victimitzacio_est_[maskMorts] <- "Mort"
dadesDesconegut_$Victimitzacio_est_ <- str_to_sentence(dadesDesconegut_$Victimitzacio_est_)
aggVictmitzacio <- aggregate(dadesDesconegut_$Numero_expedient,
by=dadesDesconegut_["Victimitzacio_est_"],
FUN=length)
ggplot(aggVictmitzacio, aes(x="", y=x, fill=Victimitzacio_est_)) +
geom_bar(width = 1, stat = "identity", color = "black") +
ggtitle(glue("Fig. 2.2. Stacked Bar Chart of Driver Count by Type of Victimization
for Those Whose Reason for Travel Was Unknown.")) +
xlab(glue("Reason for Travel is Unknown")) +
ylab("Count")
It is noted when comparing Figures 2.1 and 2.2 that the proportion of the groups is practically identical, both in the case of drivers whose reason for travel was known and in those whose reason was unknown.
We also proceed to code a new variable based on the month the accident occurred, classifying them by the corresponding quarters of the seasons of the year: Winter - “Hivern”-, Spring - “Primavera”-, Summer - “Estiu”-, and Autumn - “Tardor”.
dades_['Estacio_any'] <- "Estiu"
dades_$Estacio_any[dades_$Mes_any <= 3] <- "Hivern"
dades_$Estacio_any[dades_$Mes_any > 3 & dades_$Mes_any <= 6] <- "Primavera"
dades_$Estacio_any[dades_$Mes_any > 9] <- 'Tardor'
aggEstacions <- aggregate(dades_$Numero_expedient,
by=dades_["Estacio_any"],
FUN=length)
ggplot(aggEstacions, aes(x="", y=x, fill=Estacio_any)) +
geom_bar(width = 1, stat = "identity", color = "black") +
ggtitle(glue("Fig. 3.1. Stacked Bar Chart of Driver Count by Season of the
Accident for Those Whose Reason for Travel Was Known.")) +
xlab(glue("Reason for Travel is Known")) +
ylab("Count")
dades_$Tipus_vehicle_estandaritzat <- gsub("d'", "",
dades_$Tipus_vehicle_estandaritzat)
dades_$Victimitzacio_est_ <- gsub("d'", "",
dades_$Victimitzacio_est_)
dades_$Descripcio_causa_mediata <- gsub("d'", "",
dades_$Descripcio_causa_mediata)
dades_$Nom_mes <- gsub("ç", "s", dades_$Nom_mes)
dades_$Tipus_vehicle_estandaritzat <-
stri_trans_general(str = dades_$Tipus_vehicle_estandaritzat,
id = "Latin-ASCII")
dades_$Victimitzacio_est_ <-
stri_trans_general(str = dades_$Victimitzacio_est_,
id = "Latin-ASCII")
dades_$Descripcio_causa_mediata <-
stri_trans_general(str = dades_$Descripcio_causa_mediata,
id = "Latin-ASCII")
dades_$Nom_districte <-
stri_trans_general(str = dades_$Nom_districte,
id = "Latin-ASCII")
varsSup <- c("Descripcio_sexe", "Tipus_vehicle_estandaritzat",
"Victimitzacio_est_", "Nom_mes", "Nom_districte",
"Descripcio_causa_mediata", "Estacio_any",
"Victimes_CODIF", "Numero_morts",
"Numero_lesionats_lleus", "Numero_lesionats_greus",
"Numero_victimes", "Numero_vehicles_implicats",
"Vehicles_CODIF", "VM2R_CODIF", "VM4R_CODIF",
"V_no_permis_CODIF", "VUP_CODIF", "Interve_conductor_novell",
"Edat_CODIF", "Edat", "Es_laborable", "Es_ocupacional",
"Lleus_CODIF", "Greus_CODIF", "Morts_CODIF", "Es_atropellament",
"Es_mon_treball")
dadesSup <- dades_[varsSup]
dadesDesconegut_['Estacio_any'] <- "Estiu"
dadesDesconegut_$Estacio_any[dadesDesconegut_$Mes_any <= 3] <- "Hivern"
dadesDesconegut_$Estacio_any[dadesDesconegut_$Mes_any > 3 & dadesDesconegut_$Mes_any <= 6] <- "Primavera"
dadesDesconegut_$Estacio_any[dadesDesconegut_$Mes_any > 9] <- 'Tardor'
aggEstacions <- aggregate(dadesDesconegut_$Numero_expedient,
by=dadesDesconegut_["Estacio_any"],
FUN=length)
ggplot(aggEstacions, aes(x="", y=x, fill=Estacio_any)) +
geom_bar(width = 1, stat = "identity", color = "black") +
ggtitle(glue("Fig. 3.2. Stacked Bar Chart of Driver Count by Season of the
Accident for Those Whose Reason for Travel Was Unknown.")) +
xlab(glue("Reason for Travel is Unknown")) +
ylab("Count")
dadesDesconegut_$Tipus_vehicle_estandaritzat <- gsub("d'", "",
dadesDesconegut_$Tipus_vehicle_estandaritzat)
dadesDesconegut_$Victimitzacio_est_ <- gsub("d'", "",
dadesDesconegut_$Victimitzacio_est_)
dadesDesconegut_$Descripcio_causa_mediata <- gsub("d'", "",
dadesDesconegut_$Descripcio_causa_mediata)
dadesDesconegut_$Nom_mes <- gsub("ç", "s", dadesDesconegut_$Nom_mes)
dadesDesconegut_$Tipus_vehicle_estandaritzat <-
stri_trans_general(str = dadesDesconegut_$Tipus_vehicle_estandaritzat,
id = "Latin-ASCII")
dadesDesconegut_$Victimitzacio_est_ <-
stri_trans_general(str = dadesDesconegut_$Victimitzacio_est_,
id = "Latin-ASCII")
dadesDesconegut_$Descripcio_causa_mediata <-
stri_trans_general(str = dadesDesconegut_$Descripcio_causa_mediata,
id = "Latin-ASCII")
dadesDesconegut_$Nom_districte <-
stri_trans_general(str = dadesDesconegut_$Nom_districte,
id = "Latin-ASCII")
varsSup <- c("Descripcio_sexe", "Tipus_vehicle_estandaritzat",
"Victimitzacio_est_", "Nom_mes", "Nom_districte",
"Descripcio_causa_mediata", "Estacio_any",
"Victimes_CODIF", "Numero_morts",
"Numero_lesionats_lleus", "Numero_lesionats_greus",
"Numero_victimes", "Numero_vehicles_implicats",
"Vehicles_CODIF", "VM2R_CODIF", "VM4R_CODIF",
"V_no_permis_CODIF", "VUP_CODIF", "Interve_conductor_novell",
"Edat_CODIF", "Edat", "Es_laborable", "Es_ocupacional",
"Lleus_CODIF", "Greus_CODIF", "Morts_CODIF", "Es_atropellament",
"Es_mon_treball")
dadesDesconegutSup <- dadesDesconegut_[varsSup]
Once again, it is observed that there is no substantial difference in the distribution of groups between both subsets of data, with both cases seeing the highest number of injured drivers during spring. However, in the case of drivers whose reason for travel was unknown, there is a slight increase in the proportion during the winter season.
In the conclusions of the second part of this series, we noted several shortcomings in the modeling and its accuracy, possibly due to the fact that there were a very high number of variables in play. For this reason, in this section, we will first attempt to build a classification model using the Logistic Regression algorithm to identify which of the total of 27 usable variables are statistically significant for identifying the dependent variable ‘Es_mon_treball’. Subsequently, based on the results obtained from this first algorithm, we will construct various models with different types of algorithms such as unpruned and pruned decision trees, and with various parameterizations in addition to the random forest and XGBoost models.
From the decomposition of the principal elements carried out in section 5 of the first part, we already know that the numerical variables, with the exception of ‘Numero_victimes’ and ‘Numero_lesionats_greus’, show little correlation with each other. In any case, we will seek to introduce them into the model both in their continuous version and in the encoded form. We will then proceed to perform the analysis of correlations between variables using the Logistic Regression algorithm:
dadesSup["EMT_recodificat"] <- ifelse(dadesSup$Es_mon_treball=="Si", 1, 0)
cols_ <- colnames(dadesSup)
df_dtype_ <- as.data.frame(sapply(dadesSup, class))
for (col in cols_){
data_type <- paste(df_dtype_[col, 1])
n_domini <- length(unique(dadesSup[, col]))
if (data_type == "character"){
dadesSup[, col] <- as.factor(dadesSup[, col])
}
else if (n_domini <= 2) {
dadesSup[, col] <- as.factor(dadesSup[, col])
}
}
dadesSup$Numero_morts <- as.integer(dadesSup$Numero_morts)
cols_ <- colnames(dadesDesconegutSup)
df_dtype_ <- as.data.frame(sapply(dadesDesconegutSup, class))
for (col in cols_){
data_type <- paste(df_dtype_[col, 1])
n_domini <- length(unique(dadesDesconegutSup[, col]))
if (data_type == "character"){
dadesDesconegutSup[, col] <- as.factor(dadesDesconegutSup[, col])
}
else if (n_domini <= 2) {
dadesDesconegutSup[, col] <- as.factor(dadesDesconegutSup[, col])
}
}
dadesDesconegutSup$Numero_morts <- as.integer(dadesDesconegutSup$Numero_morts)
model <- glm(EMT_recodificat ~ Descripcio_sexe +
Edat_CODIF + Edat + Tipus_vehicle_estandaritzat +
Victimitzacio_est_ +
Descripcio_causa_mediata + Estacio_any +
Numero_morts + Morts_CODIF +
Numero_lesionats_lleus + Lleus_CODIF +
Numero_lesionats_greus + Greus_CODIF +
Numero_vehicles_implicats + Vehicles_CODIF +
Numero_victimes + Victimes_CODIF +
Nom_districte + VM2R_CODIF + VM4R_CODIF + V_no_permis_CODIF +
VUP_CODIF + Interve_conductor_novell + Es_atropellament +
Es_laborable + Es_ocupacional,
data=dadesSup, family=binomial(link=logit), na.action = NULL)
summary(model)
##
## Call:
## glm(formula = EMT_recodificat ~ Descripcio_sexe + Edat_CODIF +
## Edat + Tipus_vehicle_estandaritzat + Victimitzacio_est_ +
## Descripcio_causa_mediata + Estacio_any + Numero_morts + Morts_CODIF +
## Numero_lesionats_lleus + Lleus_CODIF + Numero_lesionats_greus +
## Greus_CODIF + Numero_vehicles_implicats + Vehicles_CODIF +
## Numero_victimes + Victimes_CODIF + Nom_districte + VM2R_CODIF +
## VM4R_CODIF + V_no_permis_CODIF + VUP_CODIF + Interve_conductor_novell +
## Es_atropellament + Es_laborable + Es_ocupacional, family = binomial(link = logit),
## data = dadesSup, na.action = NULL)
##
## Coefficients: (2 not defined because of singularities)
## Estimate
## (Intercept) -3.141604
## Descripcio_sexeHome 0.057154
## Edat_CODIFConductors entre 30 i 37 anys edat -0.099422
## Edat_CODIFConductors fins 29 anys edat -0.595047
## Edat_CODIFConductors majors de 49 anys edat 0.201110
## Edat -0.018628
## Tipus_vehicle_estandaritzatVehicles motoritzats de quatre rodes 0.167033
## Tipus_vehicle_estandaritzatVehicles sense permis de conduccio -0.190535
## Tipus_vehicle_estandaritzatVehicles Us Professional 4.530266
## Victimitzacio_est_Hospitalitzacio fins a 24h 0.159140
## Victimitzacio_est_Hospitalitzacio superior a 24h 0.611889
## Victimitzacio_est_Mort -0.896369
## Victimitzacio_est_Rebutja assistencia sanitaria 0.284140
## Descripcio_causa_mediataCanvi de carril sense precaucio -0.227889
## Descripcio_causa_mediataDesobeir altres senyals -0.402497
## Descripcio_causa_mediataDesobeir semafor -0.126338
## Descripcio_causa_mediataEnvair calcada contraria -1.181084
## Descripcio_causa_mediataFallada mecanica o avaria -10.738271
## Descripcio_causa_mediataGir indegut o sense precaucio -0.339264
## Descripcio_causa_mediataManca atencio a la conduccio -0.277985
## Descripcio_causa_mediataManca precaucio efectuar marxa enrera -0.285622
## Descripcio_causa_mediataManca precaucio incorporacio circulacio -0.220823
## Descripcio_causa_mediataNo cedir la dreta -0.181772
## Descripcio_causa_mediataNo respectar distancies -0.497367
## Descripcio_causa_mediataNo respectat pas de vianants -0.022101
## Estacio_anyHivern 0.007894
## Estacio_anyPrimavera 0.233228
## Estacio_anyTardor 0.038313
## Numero_morts 1.657122
## Morts_CODIFSi NA
## Numero_lesionats_lleus -0.020486
## Lleus_CODIFSi 0.552134
## Numero_lesionats_greus 1.202418
## Greus_CODIFSi -1.876713
## Numero_vehicles_implicats -0.033703
## Vehicles_CODIFSi 0.799640
## Numero_victimes NA
## Victimes_CODIFSi -0.383877
## Nom_districteEixample 0.070432
## Nom_districteGracia 0.020947
## Nom_districteHorta-Guinardo -0.344914
## Nom_districteLes Corts 0.450337
## Nom_districteNou Barris -0.086124
## Nom_districteSant Andreu -0.328615
## Nom_districteSant Marti 0.113460
## Nom_districteSants-Montjuic 0.345712
## Nom_districteSarria-Sant Gervasi -0.011230
## VM2R_CODIFSi 0.103620
## VM4R_CODIFSi 0.128787
## V_no_permis_CODIFSi 0.058489
## VUP_CODIFSi 0.347619
## Interve_conductor_novellSi -0.031369
## Es_atropellamentSi 0.471575
## Es_laborableSi 0.621579
## Es_ocupacionalSi 0.872729
## Std. Error
## (Intercept) 1.549508
## Descripcio_sexeHome 0.092090
## Edat_CODIFConductors entre 30 i 37 anys edat 0.152061
## Edat_CODIFConductors fins 29 anys edat 0.210195
## Edat_CODIFConductors majors de 49 anys edat 0.177586
## Edat 0.009312
## Tipus_vehicle_estandaritzatVehicles motoritzats de quatre rodes 0.124465
## Tipus_vehicle_estandaritzatVehicles sense permis de conduccio 0.135211
## Tipus_vehicle_estandaritzatVehicles Us Professional 0.736989
## Victimitzacio_est_Hospitalitzacio fins a 24h 0.101444
## Victimitzacio_est_Hospitalitzacio superior a 24h 0.743511
## Victimitzacio_est_Mort 1.733621
## Victimitzacio_est_Rebutja assistencia sanitaria 0.169696
## Descripcio_causa_mediataCanvi de carril sense precaucio 0.221669
## Descripcio_causa_mediataDesobeir altres senyals 0.246703
## Descripcio_causa_mediataDesobeir semafor 0.227335
## Descripcio_causa_mediataEnvair calcada contraria 0.978785
## Descripcio_causa_mediataFallada mecanica o avaria 196.968101
## Descripcio_causa_mediataGir indegut o sense precaucio 0.220003
## Descripcio_causa_mediataManca atencio a la conduccio 0.206952
## Descripcio_causa_mediataManca precaucio efectuar marxa enrera 0.436302
## Descripcio_causa_mediataManca precaucio incorporacio circulacio 0.274093
## Descripcio_causa_mediataNo cedir la dreta 0.371908
## Descripcio_causa_mediataNo respectar distancies 0.213964
## Descripcio_causa_mediataNo respectat pas de vianants 0.607251
## Estacio_anyHivern 0.121827
## Estacio_anyPrimavera 0.115624
## Estacio_anyTardor 0.119853
## Numero_morts 1.320503
## Morts_CODIFSi NA
## Numero_lesionats_lleus 0.073968
## Lleus_CODIFSi 0.601838
## Numero_lesionats_greus 0.578543
## Greus_CODIFSi 0.780793
## Numero_vehicles_implicats 0.090138
## Vehicles_CODIFSi 0.283757
## Numero_victimes NA
## Victimes_CODIFSi 0.140993
## Nom_districteEixample 0.220617
## Nom_districteGracia 0.284376
## Nom_districteHorta-Guinardo 0.262054
## Nom_districteLes Corts 0.257007
## Nom_districteNou Barris 0.278184
## Nom_districteSant Andreu 0.261826
## Nom_districteSant Marti 0.234503
## Nom_districteSants-Montjuic 0.242471
## Nom_districteSarria-Sant Gervasi 0.241102
## VM2R_CODIFSi 0.128622
## VM4R_CODIFSi 0.108049
## V_no_permis_CODIFSi 0.102270
## VUP_CODIFSi 0.227924
## Interve_conductor_novellSi 0.096416
## Es_atropellamentSi 0.415103
## Es_laborableSi 0.118900
## Es_ocupacionalSi 0.095891
## z value
## (Intercept) -2.027
## Descripcio_sexeHome 0.621
## Edat_CODIFConductors entre 30 i 37 anys edat -0.654
## Edat_CODIFConductors fins 29 anys edat -2.831
## Edat_CODIFConductors majors de 49 anys edat 1.132
## Edat -2.000
## Tipus_vehicle_estandaritzatVehicles motoritzats de quatre rodes 1.342
## Tipus_vehicle_estandaritzatVehicles sense permis de conduccio -1.409
## Tipus_vehicle_estandaritzatVehicles Us Professional 6.147
## Victimitzacio_est_Hospitalitzacio fins a 24h 1.569
## Victimitzacio_est_Hospitalitzacio superior a 24h 0.823
## Victimitzacio_est_Mort -0.517
## Victimitzacio_est_Rebutja assistencia sanitaria 1.674
## Descripcio_causa_mediataCanvi de carril sense precaucio -1.028
## Descripcio_causa_mediataDesobeir altres senyals -1.632
## Descripcio_causa_mediataDesobeir semafor -0.556
## Descripcio_causa_mediataEnvair calcada contraria -1.207
## Descripcio_causa_mediataFallada mecanica o avaria -0.055
## Descripcio_causa_mediataGir indegut o sense precaucio -1.542
## Descripcio_causa_mediataManca atencio a la conduccio -1.343
## Descripcio_causa_mediataManca precaucio efectuar marxa enrera -0.655
## Descripcio_causa_mediataManca precaucio incorporacio circulacio -0.806
## Descripcio_causa_mediataNo cedir la dreta -0.489
## Descripcio_causa_mediataNo respectar distancies -2.325
## Descripcio_causa_mediataNo respectat pas de vianants -0.036
## Estacio_anyHivern 0.065
## Estacio_anyPrimavera 2.017
## Estacio_anyTardor 0.320
## Numero_morts 1.255
## Morts_CODIFSi NA
## Numero_lesionats_lleus -0.277
## Lleus_CODIFSi 0.917
## Numero_lesionats_greus 2.078
## Greus_CODIFSi -2.404
## Numero_vehicles_implicats -0.374
## Vehicles_CODIFSi 2.818
## Numero_victimes NA
## Victimes_CODIFSi -2.723
## Nom_districteEixample 0.319
## Nom_districteGracia 0.074
## Nom_districteHorta-Guinardo -1.316
## Nom_districteLes Corts 1.752
## Nom_districteNou Barris -0.310
## Nom_districteSant Andreu -1.255
## Nom_districteSant Marti 0.484
## Nom_districteSants-Montjuic 1.426
## Nom_districteSarria-Sant Gervasi -0.047
## VM2R_CODIFSi 0.806
## VM4R_CODIFSi 1.192
## V_no_permis_CODIFSi 0.572
## VUP_CODIFSi 1.525
## Interve_conductor_novellSi -0.325
## Es_atropellamentSi 1.136
## Es_laborableSi 5.228
## Es_ocupacionalSi 9.101
## Pr(>|z|)
## (Intercept) 0.04261 *
## Descripcio_sexeHome 0.53484
## Edat_CODIFConductors entre 30 i 37 anys edat 0.51322
## Edat_CODIFConductors fins 29 anys edat 0.00464 **
## Edat_CODIFConductors majors de 49 anys edat 0.25744
## Edat 0.04546 *
## Tipus_vehicle_estandaritzatVehicles motoritzats de quatre rodes 0.17959
## Tipus_vehicle_estandaritzatVehicles sense permis de conduccio 0.15878
## Tipus_vehicle_estandaritzatVehicles Us Professional 7.90e-10 ***
## Victimitzacio_est_Hospitalitzacio fins a 24h 0.11671
## Victimitzacio_est_Hospitalitzacio superior a 24h 0.41052
## Victimitzacio_est_Mort 0.60512
## Victimitzacio_est_Rebutja assistencia sanitaria 0.09405 .
## Descripcio_causa_mediataCanvi de carril sense precaucio 0.30392
## Descripcio_causa_mediataDesobeir altres senyals 0.10278
## Descripcio_causa_mediataDesobeir semafor 0.57839
## Descripcio_causa_mediataEnvair calcada contraria 0.22755
## Descripcio_causa_mediataFallada mecanica o avaria 0.95652
## Descripcio_causa_mediataGir indegut o sense precaucio 0.12305
## Descripcio_causa_mediataManca atencio a la conduccio 0.17919
## Descripcio_causa_mediataManca precaucio efectuar marxa enrera 0.51270
## Descripcio_causa_mediataManca precaucio incorporacio circulacio 0.42044
## Descripcio_causa_mediataNo cedir la dreta 0.62501
## Descripcio_causa_mediataNo respectar distancies 0.02010 *
## Descripcio_causa_mediataNo respectat pas de vianants 0.97097
## Estacio_anyHivern 0.94834
## Estacio_anyPrimavera 0.04368 *
## Estacio_anyTardor 0.74922
## Numero_morts 0.20951
## Morts_CODIFSi NA
## Numero_lesionats_lleus 0.78182
## Lleus_CODIFSi 0.35893
## Numero_lesionats_greus 0.03768 *
## Greus_CODIFSi 0.01623 *
## Numero_vehicles_implicats 0.70847
## Vehicles_CODIFSi 0.00483 **
## Numero_victimes NA
## Victimes_CODIFSi 0.00648 **
## Nom_districteEixample 0.74953
## Nom_districteGracia 0.94128
## Nom_districteHorta-Guinardo 0.18811
## Nom_districteLes Corts 0.07973 .
## Nom_districteNou Barris 0.75687
## Nom_districteSant Andreu 0.20945
## Nom_districteSant Marti 0.62850
## Nom_districteSants-Montjuic 0.15393
## Nom_districteSarria-Sant Gervasi 0.96285
## VM2R_CODIFSi 0.42046
## VM4R_CODIFSi 0.23329
## V_no_permis_CODIFSi 0.56739
## VUP_CODIFSi 0.12722
## Interve_conductor_novellSi 0.74491
## Es_atropellamentSi 0.25594
## Es_laborableSi 1.72e-07 ***
## Es_ocupacionalSi < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3900.6 on 2822 degrees of freedom
## Residual deviance: 3432.0 on 2770 degrees of freedom
## AIC: 3538
##
## Number of Fisher Scoring iterations: 10
If we establish that the threshold for accepting the variable is 0.05 > p-value (Pr(>|z|)) to consider them as having statistically significant explanatory value, based on the results, we should only accept 11 of the variables considered in the Logistic Regression algorithm. These include 2 continuous numerical variables and 9 categorical variables, which we detail in the following table:
| Variable Name | Data Type | Variable Description |
|---|---|---|
| “Tipus_veh icle_estandaritzat” | Ca tegorical | Standardization into just four categories for the ‘Tipus_vehicle’ variable: ‘Professional use vehicles’ (“Vehicles d’ús professional”), ‘Four-wheeled motor vehicles’ (“Vehicles motoritzats de quatre rodes”), ‘Two-wheeled motor vehicles’ (“Vehicles motoritzats de dues rodes”), and ‘Vehicles without a driving license’ (“Vehicles sense permís de conducció”). |
| “Descri pcio_causa_mediata” | Ca tegorical | Details the immediate cause identified by the corresponding Guàrdia Urbana patrol that filed the accident report, referring to the type of maneuver or immediate circumstance that caused the accident. |
| “Nume ro_lesionats_greus” | Continous numerical | Number of people injured in the accident who required hospitalization for more than 24 hours. |
| “Greus_CODIF” | Ca tegorical | Determines whether there was one or more seriously injured individuals in the accident. |
| “Nom_mes” | Ca tegorical | Month of the year in which the accident occurred. |
| “Victimes_CODIF” | Ca tegorical | Determines whether there were two or more victims in the accident. |
| “Vehicles_CODIF” | Ca tegorical | Determines whether there were two or more vehicles involved in the accident. |
| “Edat” | Continous numerical | Complete years of age of the driver on the date of the accident. |
| “Edat_CODIF” | Ca tegorical | Age group coding based on interquartile ranges: under 29 years, from 29 to 37 years, from 37 to 49 years, and over 49 years. |
| “Es_laborable” | Ca tegorical | Determines if the date of the accident was a working day during the year 2023. |
| “Es_ocupacional” | Ca tegorical | Determines if the time of the accident was during occupational hours, from 5 AM to 4 PM, on a working day. |
We also check the precision of this model by testing it with the same data on which it was trained, measuring its sensitivity — the degree to which it correctly identifies that the travel was for a work-related reason— and its specificity— the degree to which it correctly identifies that the travel was not for a work-related reason:
pred <- predict(model, newdata=dadesSup, type='response')
dadesSup$prediction <- ifelse(pred < 0.5 ,0, 1)
prediccio <- as.factor(dadesSup$prediction)
valor_esperat <- dadesSup$EMT_recodificat
tb <- table(valor_esperat, prediccio); tb
## prediccio
## valor_esperat 0 1
## 0 1061 446
## 1 508 808
# TN: tb[1]
# FP: tb[2]
# FN: tb[3]
# TP: tb[4]
# accuracy
precisio <- (tb[4] + tb[1]) / (tb[1] + tb[2] + tb[3] + tb[4])
print(glue("\n\nLa precisió de la predicció és {round(precisio, 4)}."))
##
## La precisió de la predicció és 0.6621.
# sensitivity
sensitivitat <- tb[4] / (tb[4] + tb[3])
print(glue("\n\nLa sensitivitat de la predicció és {round(sensitivitat, 4)}."))
##
## La sensitivitat de la predicció és 0.6443.
# specifity
especificitat <- tb[1] / (tb[1] + tb[2])
print(glue("\n\nLa especificitat de la predicció és {round(especificitat, 4)}."))
##
## La especificitat de la predicció és 0.6762.
We observed that the model has moderate predictive capacity, although its specificity is higher than its sensitivity, a trait that usually indicates model overfitting. For this reason, and because the objective of using this logistic regression algorithm was to select, based on their statistical significance, which variables potentially have predictive capacity, and since it has not been tested with a part of the data that was not included in the model, we will discard it.
Also, in the case of variables that result in NA, which would denote that their values are zero or, according to another interpretations, suggest multicollinearity with other variables or classes, we will interpret these as not being explanatory either.
On the other hand, we also observe that four of the variables, such as ‘Numero_lesionats_greus’ and ‘Greus_CODIF’ as well as ‘Edat’ and ‘Edat_CODIF’, derive from each other. These four variables will be considered in the following modeling with the XGBoost algorithm, but for decision trees, only the two categorical variables ‘Greus_CODIF’ and ‘Edat_CODIF’ can be considered.
Being implemented the XGBoost algorithm following the example described in the work of P. Bruce, A. Bruce, and P. Gedeck to work with boosting algorithms (BRUCE; BRUCE & GEDECK, 2022: 263-265), as proposed in the conclusions of the second part, applying a workflow to execute 500 different models with various random parameterizations, while automating the result control process to select the most optimal model, as well as performing these checks with various portions of the training data set, following the detailed example by T. Pham on RPubs. Subsequently, we will perform the corresponding classification with the dataset of drivers involved in accidents whose reason for travel is unknown.
We will also build the model with the 11 variables which, based on the analysis performed in the previous section, would be explanatory for understanding the outcome of the dependent variable ‘Es_mon_treball’; subsequently, we will also check which set of related variables results in a better outcome in the classification on the testing data subsets created during the flow.
First, we will consider all 11 variables:
data.xgb <- dadesSup[, c("Tipus_vehicle_estandaritzat",
"Descripcio_causa_mediata",
"Nom_mes",
"Numero_lesionats_greus",
"Greus_CODIF",
"Victimes_CODIF",
"Vehicles_CODIF",
"Edat_CODIF", "Edat", "Es_laborable", "Es_ocupacional",
"Es_mon_treball")]
set.seed(123)
cust_split <- data.xgb %>%
initial_split(prop = 0.8, strata = Es_mon_treball)
train <- training(cust_split)
test <- testing(cust_split)
# Cross validation folds from training dataset
set.seed(234)
folds <- vfold_cv(train, strata = Es_mon_treball)
cust_rec <- recipe(Es_mon_treball ~., data = train) %>%
# update_role(customerID, new_role = "ID") %>%
# step_corr(all_numeric()) %>%
step_corr(all_numeric(), threshold = 0.7, method = "spearman") %>%
step_zv(all_numeric()) %>% # filter zero variance
step_normalize(all_numeric()) %>%
step_dummy(all_nominal_predictors())
# Setup a model specification
xgb_spec <-boost_tree(
trees = 500,
tree_depth = tune(),
min_n = tune(),
loss_reduction = tune(), ## first three: model complexity
sample_size = tune(), mtry = tune(), ## randomness
learn_rate = tune() ## step size
) %>%
set_engine("xgboost") %>%
set_mode("classification")
xgb_wf <- workflow() %>%
add_formula(Es_mon_treball ~.) %>%
add_model(xgb_spec)
xgb_grid <- grid_latin_hypercube(
tree_depth(),
min_n(),
loss_reduction(),
sample_size = sample_prop(),
finalize(mtry(), train),
learn_rate(),
size = 20
)
doParallel::registerDoParallel()
set.seed(234)
xgb_res <-tune_grid(
xgb_wf,
resamples = folds,
grid = xgb_grid,
control = control_grid(save_pred = TRUE)
)
xgb_res %>%
collect_metrics() %>%
filter(.metric == "roc_auc") %>%
select(mean, mtry:sample_size) %>%
pivot_longer(mtry:sample_size,
names_to = "parameter",
values_to = "value") %>%
ggplot(aes(value, mean, color = parameter)) +
geom_point(show.legend = FALSE)+
facet_wrap(~parameter, scales = "free_x")
show_best(xgb_res)
## # A tibble: 5 × 12
## mtry min_n tree_depth learn_rate loss_reduction sample_size .metric
## <int> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 9 4 7 0.00000000628 0.0985 0.690 roc_auc
## 2 7 3 5 0.0000366 0.000384 0.386 roc_auc
## 3 8 14 2 0.0500 0.0000364 0.492 roc_auc
## 4 10 7 4 0.0224 2.45 0.136 roc_auc
## 5 11 18 6 0.000269 0.00000428 0.902 roc_auc
## # ℹ 5 more variables: .estimator <chr>, mean <dbl>, n <int>, std_err <dbl>,
## # .config <chr>
best_auc <- select_best(xgb_res)
final_xgb <- finalize_workflow(xgb_wf, best_auc)
final_rs1 <- last_fit(final_xgb, cust_split,
metrics = metric_set(accuracy, roc_auc, sens,spec))
final_rs1 %>%
collect_metrics()
## # A tibble: 4 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.671 Preprocessor1_Model1
## 2 sens binary 0.725 Preprocessor1_Model1
## 3 spec binary 0.610 Preprocessor1_Model1
## 4 roc_auc binary 0.708 Preprocessor1_Model1
Secondly, we will build the model with 9 variables, excluding ‘Greus_CODIF’ and ‘Edat_CODIF’:
best_auc <- select_best(xgb_res)
final_xgb <- finalize_workflow(xgb_wf, best_auc)
final_rs2 <- last_fit(final_xgb, cust_split,
metrics = metric_set(accuracy, roc_auc, sens,spec))
final_rs2 %>%
collect_metrics()
## # A tibble: 4 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.673 Preprocessor1_Model1
## 2 sens binary 0.699 Preprocessor1_Model1
## 3 spec binary 0.644 Preprocessor1_Model1
## 4 roc_auc binary 0.710 Preprocessor1_Model1
Thirdly, we will build the model with 10 variables, excluding the variable ‘Numero_lesionats_greus’:
best_auc <- select_best(xgb_res)
final_xgb <- finalize_workflow(xgb_wf, best_auc)
final_rs3 <- last_fit(final_xgb, cust_split,
metrics = metric_set(accuracy, roc_auc, sens,spec))
final_rs3 %>%
collect_metrics()
## # A tibble: 4 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.673 Preprocessor1_Model1
## 2 sens binary 0.699 Preprocessor1_Model1
## 3 spec binary 0.644 Preprocessor1_Model1
## 4 roc_auc binary 0.710 Preprocessor1_Model1
Fourthly, we will build the model with 9 variables, excluding the variables ‘Numero_lesionats_greus’ and ‘Edat’:
best_auc <- select_best(xgb_res)
final_xgb <- finalize_workflow(xgb_wf, best_auc)
final_rs4 <- last_fit(final_xgb, cust_split,
metrics = metric_set(accuracy, roc_auc, sens,spec))
final_rs4 %>%
collect_metrics()
## # A tibble: 4 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.666 Preprocessor1_Model1
## 2 sens binary 0.679 Preprocessor1_Model1
## 3 spec binary 0.652 Preprocessor1_Model1
## 4 roc_auc binary 0.702 Preprocessor1_Model1
Fifthly, we will build the model with 10 variables, excluding the variable ‘Greus_CODIF’:
best_auc <- select_best(xgb_res)
final_xgb <- finalize_workflow(xgb_wf, best_auc)
final_rs5 <- last_fit(final_xgb, cust_split,
metrics = metric_set(accuracy, roc_auc, sens,spec))
final_rs5 %>%
collect_metrics()
## # A tibble: 4 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.668 Preprocessor1_Model1
## 2 sens binary 0.712 Preprocessor1_Model1
## 3 spec binary 0.617 Preprocessor1_Model1
## 4 roc_auc binary 0.705 Preprocessor1_Model1
Sixthly, we will build the model with 10 variables, excluding the variable ‘Edat’:
best_auc <- select_best(xgb_res)
final_xgb <- finalize_workflow(xgb_wf, best_auc)
final_rs6 <- last_fit(final_xgb, cust_split,
metrics = metric_set(accuracy, roc_auc, sens,spec))
final_rs6 %>%
collect_metrics()
## # A tibble: 4 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.666 Preprocessor1_Model1
## 2 sens binary 0.679 Preprocessor1_Model1
## 3 spec binary 0.652 Preprocessor1_Model1
## 4 roc_auc binary 0.699 Preprocessor1_Model1
And finally, we will build the model with 10 variables, excluding the variable ‘Edat_CODIF’:
best_auc <- select_best(xgb_res)
final_xgb <- finalize_workflow(xgb_wf, best_auc)
final_rs7 <- last_fit(final_xgb, cust_split,
metrics = metric_set(accuracy, roc_auc, sens,spec))
final_rs7 %>%
collect_metrics()
## # A tibble: 4 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.670 Preprocessor1_Model1
## 2 sens binary 0.699 Preprocessor1_Model1
## 3 spec binary 0.636 Preprocessor1_Model1
## 4 roc_auc binary 0.712 Preprocessor1_Model1
Next, we summarize the results obtained in this table, which will include not only accuracy, sensitivity, and specificity but also the so-called Area under the Receiver Operating Characteristic (AUROC), which describes the proportion of the dataset area that would be explained by the probability function of the model:
| Model | Excluded Variables | Ac curacy | Sensi tivity | Speci ficity | AUROC |
|---|---|---|---|---|---|
| Model #1 | None | 0.6714 | 0.7252 | 0.6098 | 0.7077 |
| Model #2 | “Greus_CODIF”, ” Edat_CODIF” |
0.6731 | 0.6987 | 0.6439 | 0.7099 |
| Model #3 | “Numero_lesionats _greus” | 0.6731 | 0.6987 | 0.6439 | 0.7104 |
| Model #4 | ” Numero_lesionats_ greus”, “Edat” |
0.6661 | 0.6788 | 0.6515 | 0.7016 |
| Model #5 | “Greus _CODIF” | 0.6678 | 0.7119 | 0.6174 | 0.7049 |
| Model #6 | “Edat” | 0.6661 | 0.6788 | 0.6515 | 0.6994 |
| Model #7 | “Edat_CODIF” | 0.6696 | 0.6987 | 0.6364 | 0.7120 |
In total, up to seven different classification models have been built, differing in the use or non-use of the two continuous numerical variables ‘Edat’ and ‘Numero_lesionats_greus’ and their corresponding coded variables ‘Edat_CODIF’ and ‘Greus_CODIF’. Generally, all seven models show similar accuracy and AUROC scores. However, the most significant differences are observed in the corresponding sensitivity and specificity, except in the cases of Models #2 and #3 where only differences in accuracy or AUROC are observed. It is also noticeable in all cases that sensitivity is higher than specificity. Among these, in the case of Models #3, it is observed that only one of the numerical variables is excluded in each, while in the case of Model #2, the two coded categorical variables are excluded.
Finally, we choose Model #2 as it is the classification model that combines a high proportion of correct positive and negative predictions and, on the other hand, because it’s more easily explained since it does not use the same variable twice, both in its numerical and categorical expression.
final_wf <- final_rs2 %>%
extract_workflow()
desconegutsPred <- dadesDesconegutSup[, c("Tipus_vehicle_estandaritzat",
"Descripcio_causa_mediata",
"Nom_mes",
"Numero_lesionats_greus",
"Victimes_CODIF",
"Vehicles_CODIF",
"Edat", "Es_laborable", "Es_ocupacional")]
prediction.xgb <- predict(final_wf, desconegutsPred)
table(prediction.xgb)
## .pred_class
## No Si
## 1113 1083
In the classification results of the dataset of drivers whose reason for travel is unknown, we observe that both classes have been assigned almost equally.
Next, we will build a classification model using the Decision Tree algorithm from the C50 module in R, without any pruning or changes to its default parameters:
First, we will check it by also including the variable ‘Nom_mes’:
dummy[] <- lapply(dummy, factor)
colsEscollides <- c("Tipus_vehicle_estandaritzat",
"Descripcio_causa_mediata",
"Nom_mes",
"Greus_CODIF",
"Victimes_CODIF",
"Vehicles_CODIF",
"Edat_CODIF", "Es_laborable", "Es_ocupacional")
y <- dummy$Es_mon_treball
colsSup <- colnames(dummy) %in% colsEscollides
x <- dummy[colsSup]
split_prop <- 5
indexes = sample(1:nrow(dummy),
size=floor(((split_prop-1)/split_prop)*nrow(dummy)))
train_x <- x[indexes, ]
train_y <- y[indexes]
test_x <- x[-indexes, ]
test_y <- y[-indexes]
model1 <- C50::C5.0(train_x, train_y, rules=TRUE)
summary(model1)
##
## Call:
## C5.0.default(x = train_x, y = train_y, rules = TRUE)
##
##
## C5.0 [Release 2.07 GPL Edition] Thu Sep 5 17:19:23 2024
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 2258 cases (10 attributes) from undefined.data
##
## Rules:
##
## Rule 1: (1217/386, lift 1.3)
## Tipus_vehicle_estandaritzat in {Vehicles motoritzats de 2 rodes,
## Vehicles motoritzats de quatre rodes,
## Vehicles sense permis de conduccio}
## Es_ocupacional = No
## -> class No [0.683]
##
## Rule 2: (61/1, lift 2.1)
## Tipus_vehicle_estandaritzat = Vehicles Us Professional
## -> class Si [0.968]
##
## Rule 3: (999/391, lift 1.3)
## Es_ocupacional = Si
## -> class Si [0.608]
##
## Default class: No
##
##
## Evaluation on training data (2258 cases):
##
## Rules
## ----------------
## No Errors
##
## 3 778(34.5%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 831 392 (a): class No
## 386 649 (b): class Si
##
##
## Attribute usage:
##
## 98.14% Es_ocupacional
## 56.60% Tipus_vehicle_estandaritzat
##
##
## Time: 0.0 secs
model1 <- C50::C5.0(train_x, train_y)
plot(model1, type="s", title="Fig. 4. Arbre de decisió sense podar.")
predicted_model1 <- predict(model1, test_x, type="class",
threshold=0.7)
print(sprintf("La precisió de l'arbre sense podar és del %.4f %%.",
100*sum(predicted_model1 == test_y) / length(predicted_model1)))
## [1] "La precisió de l'arbre sense podar és del 65.1327 %."
mat_conf <- table(test_y, Predicted=predicted_model1); mat_conf
## Predicted
## test_y No Si
## No 183 101
## Si 96 185
# TN: mat_conf[1]
# FP: mat_conf[2]
# FN: mat_conf[3]
# TP: mat_conf[4]
# sensitivity
sensitivitat <- mat_conf[4] / (mat_conf[4] + mat_conf[3])
print(glue("\n\nLa sensitivitat de la predicció és {round(sensitivitat, 4)}."))
##
## La sensitivitat de la predicció és 0.6469.
# specifity
especificitat <- mat_conf[1] / (mat_conf[1] + mat_conf[2])
print(glue("\n\nL'especificitat de la predicció és {round(especificitat, 4)}."))
##
## L'especificitat de la predicció és 0.6559.
In this case, with the unpruned decision tree, a notable improvement in its intelligibility is observed in contrast to the decision tree models built in the second part, noting that the model primarily uses the variables ‘Tipus_vehicle_estandarditzat’ and ‘Es_ocupacional’ for classification. However, we also note that the trend continues for sensitivity to be lower than specificity.
Iti’s remembered too that the decision tree implemented here has the flaw of the lack of robustness of its results. As we already observed in the previous second part, this dataset presents very small class groups which, during the creation of the training dataset through random selection, can be especially underrepresented or even entirely omitted. This phenomenon can cause the results obtained in one iteration to be different from the next, and therefore, its reliability becomes potentially questionable. Finally, we note that, in the case of the decision tree, since the algorithm prioritizes the optimal use of variables with the greatest explanatory capacity such as vehicle type and whether the accident occurred during occupational hours, the other seven categorical variables have virtually no effect.
desconegutsPred <- dadesDesconegutSup[, c("Tipus_vehicle_estandaritzat",
"Descripcio_causa_mediata",
"Nom_mes",
"Greus_CODIF",
"Victimes_CODIF",
"Vehicles_CODIF",
"Edat_CODIF", "Es_laborable", "Es_ocupacional")]
predictions <- predict(model1, desconegutsPred, type="class", threshold=0.7)
table(predictions)
## predictions
## No Si
## 1043 1153
On the other hand, it is noteworthy that the proportion of negative and positive classifications is virtually identical to the proportion of positive and negative classes of the ‘Es_ocupacional’ variable for drivers whose reason for travel is unknown (see Fig. 1.2 above). This suggests that the model is practically using only the ‘Es_ocupacional’ variable to make this classification, possibly without successfully classifying drivers who traveled on non-working days or during non-occupational hours.
We will perform the same decision tree but forcing it to perform up to 99 iterations in order to choose the most optimal predictions in each:
nTrials <-99
model_nTrials <- C50::C5.0(train_x, train_y, trials = nTrials)
predicted_modelnTrials <- predict(model_nTrials, test_x, type="class")
print(sprintf("La precisió de l'arbre amb 99 iteracions és del %.4f %%.",
100*sum(predicted_modelnTrials == test_y) / length(predicted_modelnTrials)))
## [1] "La precisió de l'arbre amb 99 iteracions és del 64.6018 %."
mat_conf <- table(test_y, Predicted=predicted_modelnTrials); mat_conf
## Predicted
## test_y No Si
## No 199 85
## Si 115 166
# TN: mat_conf[1]
# FP: mat_conf[2]
# FN: mat_conf[3]
# TP: mat_conf[4]
# sensitivity
sensitivitat <- mat_conf[4] / (mat_conf[4] + mat_conf[3])
print(glue("\n\nLa sensitivitat de la predicció és {round(sensitivitat, 4)}."))
##
## La sensitivitat de la predicció és 0.6614.
# specifity
especificitat <- mat_conf[1] / (mat_conf[1] + mat_conf[2])
print(glue("\n\nL'especificitat de la predicció és {round(especificitat, 4)}."))
##
## L'especificitat de la predicció és 0.6338.
desconegutsPred <- dadesDesconegutSup[, c("Tipus_vehicle_estandaritzat",
"Descripcio_causa_mediata",
"Nom_mes",
"Greus_CODIF",
"Victimes_CODIF",
"Vehicles_CODIF",
"Edat_CODIF", "Es_laborable", "Es_ocupacional")]
predictions <- predict(model_nTrials, desconegutsPred, type="class", threshold=0.7)
table(predictions)
## predictions
## No Si
## 1190 1006
In this model, an improvement in sensitivity is observed, with this indicator also being above specificity. Moreover, when classifying the data of drivers whose reason for travel is unknown, we observe that the vast majority of results are negative.
Subsequently, we will also build a classification model using the Random Forest that will perform 500 iterations on the training dataset:
library(rpart)
##
## Adjuntando el paquete: 'rpart'
## The following object is masked from 'package:dials':
##
## prune
dummy[] <- lapply(dummy, factor)
colsEscollides <- c("Tipus_vehicle_estandaritzat",
"Descripcio_causa_mediata",
"Nom_mes",
"Greus_CODIF",
"Victimes_CODIF",
"Vehicles_CODIF",
"Edat_CODIF", "Es_laborable", "Es_ocupacional")
y <- dummy$Es_mon_treball
colsSup <- colnames(dummy) %in% colsEscollides
x <- dummy[colsSup]
y <- dummy$Es_mon_treball
rf <- randomForest(y ~., data = x, threshold=0.7); rf
##
## Call:
## randomForest(formula = y ~ ., data = x, threshold = 0.7)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 37.27%
## Confusion matrix:
## No Si class.error
## No 962 545 0.3616457
## Si 507 809 0.3852584
mat_conf <- rf$confusion
# sensitivity
sensitivitat <- mat_conf[4] / (mat_conf[4] + mat_conf[3])
print(glue("\n\nLa sensitivitat de la predicció és {round(sensitivitat, 4)}."))
##
## La sensitivitat de la predicció és 0.5975.
especificitat <- mat_conf[1] / (mat_conf[1] + mat_conf[2])
print(glue("\n\nLa especificitat de la predicció és {round(especificitat, 4)}."))
##
## La especificitat de la predicció és 0.6549.
predictions.rf <- predict( rf, desconegutsPred, type="class")
table(predictions.rf)
## predictions.rf
## No Si
## 1064 1132
It is noted that the outcome of the model is very similar to the previous one where we had modified the parameterization with 99 iterations, although the result of the classifications is manifestly very different.
Finally, we will also try with the pruned tree method that we had implemented in the second part. In this case, we will limit ourselves to building the model using the tree() function from the Rpart module so that it selects the three branches that yield the most optimal result.
dummy[] <- lapply(dummy, factor)
colsEscollides <- c("Tipus_vehicle_estandaritzat",
"Descripcio_causa_mediata",
"Nom_mes",
"Greus_CODIF",
"Victimes_CODIF",
"Vehicles_CODIF",
"Edat_CODIF", "Es_laborable", "Es_ocupacional",
"Es_mon_treball")
y <- dummy$Es_mon_treball
colsSup <- colnames(dummy) %in% colsEscollides
dummy_ <- dummy[colsSup]
split <- createDataPartition(y=dummy_$Es_mon_treball, p=4/5, list=FALSE)
train <- dummy_[split,]
test <- dummy_[-split,]
trees <- tree(Es_mon_treball~., train)
prune.trees <- prune.tree(trees, best=3)
tree.pred <- predict(prune.trees, test, type='class', threshold=0.7)
confusionMatrix(tree.pred, test$Es_mon_treball, positive = "Si")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Si
## No 197 100
## Si 104 163
##
## Accuracy : 0.6383
## 95% CI : (0.5971, 0.678)
## No Information Rate : 0.5337
## P-Value [Acc > NIR] : 3.21e-07
##
## Kappa : 0.274
##
## Mcnemar's Test P-Value : 0.8336
##
## Sensitivity : 0.6198
## Specificity : 0.6545
## Pos Pred Value : 0.6105
## Neg Pred Value : 0.6633
## Prevalence : 0.4663
## Detection Rate : 0.2890
## Detection Prevalence : 0.4734
## Balanced Accuracy : 0.6371
##
## 'Positive' Class : Si
##
prediction.prune.tree <- predict(prune.trees, desconegutsPred, type='class',
threshold=0.7)
table(prediction.prune.tree)
## prediction.prune.tree
## No Si
## 1043 1153
It is observed that in the case of the pruned tree model, the results are similar to those of the model built with the unpruned decision tree algorithm, with the positive class of the dependent variable ‘Es_mon_treball’ also being identical when classifying the data subset of drivers whose reason for travel is unknown.
First, we will present the Table with the most relevant results of this third part and then justify the choice of the XGBoost model result.
Only the accuracy indicators of the logistic regression, XGBoost, and pruned tree models will be detailed since in all three cases the resulting model is robust enough, that is, it does not show appreciable variability in its results.
| Index | Results Description | Data Type | Resultats |
|---|---|---|---|
| #1 | Total number of drivers | Integer | 5019 |
| #2 | Subtotal of drivers whose reason for travel is known, proportion of the total | Integer, percent | 2823, 56.246% |
| #3 | Number of accidents involving drivers whose reason for travel is known | Integer | 2561 |
| #4 | Subtotal of drivers whose reason for travel is unknown, proportion of the total | Integer, percent | 2196, 43.754% |
| #5 | Number of accidents involving drivers whose reason for travel is unknown | Integer | 2029 |
| #6 | Drivers injured during occupational hours, proportion of total drivers whose reason for travel is known | Integer, percent | 1272, 45.058% |
| #7 | Drivers injured during non-occupational hours, proportion of total drivers whose reason for travel is known | Integer, percent | 1551, 54.942% |
| #8 | Drivers injured during occupational hours, proportion of total drivers whose reason for travel is unknown | Integer, percent | 1150, 52.368 % |
| #9 | Drivers injured during non-occupational hours, proportion of total drivers whose reason for travel is unknown | Integer, percent | 1046, 47.632 % |
| #10 | Number of independent variables assessed for inclusion in the classification model | Integer | 29 |
| #11 | Number of independent variables accepted for inclusion in subsequent classification models | Integer | 11 |
| #12 | Acceptance threshold for the positive class result | Float | 0.7 |
| #13 | In the Logistic Regression model, Sensitivity was higher than Specificity | Boolean | False |
| #14 | Accuracy, Sensitivity, and Specificity of the Logistic Regression model | Float, float, float | 0.6645, 0.6468, 0.6788 |
| #14 | In the XGBoost model, Sensitivity was higher than Specificity | Boolean | True |
| #15 | Accuracy, Sensitivity, and Specificity of the XGBoost model | Float, float, float | 0.6731, 0.6987, 0.6439 |
| #16 | In the unparameterized Decision Tree model, Sensitivity was higher than Specificity | Boolean | False |
| #17 | In the parameterized Decision Tree model set to run 99 cycles, Sensitivity was higher than Specificity | Boolean | False |
| #18 | In the Random Forest model, Sensitivity was higher than Specificity | Boolean | False |
| #19 | In the Pruned Tree model, Sensitivity was higher than Specificity | Boolean | False |
| #20 | Accuracy, Sensitivity, and Specificity of the Pruned Tree model | Float, float, float | 0.6525, 0.6084, 0.6910 |
In this third part of our task of classifying drivers injured in traffic accidents during 2023, we have focused to some extent on practically implementing an imputation for the 2,196 drivers whose reason for travel was unknown. The relevance of this imputation lies in the fact that these drivers accounted for up to 43% of the total injured drivers and for whom we had a complete set of valid data (see row #3 in Results Table). Such a high proportion made the variable of the reason for travel unreliable for analysis, but with the work done so far, we assess that it could now be usable as another variable to enrich the study of the trends in traffic accidents involving injured drivers where agents of the Barcelona Guàrdia Urbana intervened.
It is also important to value the variables accepted for their statistical significance in explaining the outcome, which in this case is to indicate “Yes”/“No” for each driver in relation to whether the reason for their travel at the time of being involved in the documented accident was related to the world of work—see details of the significance of this variable in the first part of this study. This is not only because the corresponding algorithm assesses it as statistically optimal, but also because they are consistent with what we know of the world of work. Therefore, it is congruent that the type of vehicle—especially professional use vehicles—or the fact that the accident took place during a working day and occupational hours, are relevant for considering the travel as work-related. Additionally, more intuitively, it also makes sense that age, or in which month or season of the year it took place, have a clear analytical relevance. In this line, and thus also for its subsequent interpretation, it is interesting that the number of serious injuries, whether the number of victims or vehicles involved was more than two, as well as the immediate cause of the accident, also show statistically relevant relationships with a travel reason related, although this should not necessarily also be understood as its cause. It is also of interest that neither the variable referring to biological sex nor whether any of the involved drivers had less than 5 years of experience have proven to be statistically significant when classifying results with the logistic regression algorithm. The case of the District where it took place has also been discarded, suggesting the conclusion that administrative divisions of the territory are not significant in explaining the classification of whether the driver was traveling for a work-related reason or not.
Regarding the XGBoost classification model finally chosen, the implemented classification is robust because it does not derive from a single data split to obtain the training dataset or a randomly chosen parameterization, but rather it is chosen because it explains a greater proportion of the outcomes of the dependent variable ‘Es_mon_treball’ according to the known indicator as Area under a curve (AUC) after conducting 500 different tests, each time seeking to optimize the last result in relation to the previously obtained one, applying weighting to the smaller class groups, thus reducing the bias already detected in the dataset in the second part. A feature that adds reliability to the result is found in the fact that it is the only classification model which, when verifying the proposed classification, also shows a greater proportion of correct predictions for the positive outcome—Si, the reason for the travel is related to the world of work—compared to the negative outcome.
head(dadesDesconegut__)
## Numero_expedient Descripcio_sexe Edat Descripció_tipus_persona
## 1 2023S000006 Home 49 Conductor
## 2 2023S000007 Home 27 Conductor
## 3 2023S000015 Home 51 Conductor
## 4 2023S000028 Home 49 Conductor
## 5 2023S000033 Home 42 Conductor
## 6 2023S000038 Dona 43 Conductor
## Descripcio_Lloc_atropellament_vianant
## 1 Desconegut
## 2 Desconegut
## 3 Desconegut
## 4 Desconegut
## 5 Desconegut
## 6 Desconegut
## Descripcio_Motiu_desplacament_conductor Desc_Tipus_vehicle_implicat
## 1 Es desconeix Motocicleta
## 2 Es desconeix Motocicleta
## 3 Es desconeix Furgoneta
## 4 Es desconeix Motocicleta
## 5 Es desconeix Motocicleta
## 6 Es desconeix Ciclomotor
## Tipus_vehicle_estandaritzat
## 1 Vehicles motoritzats de 2 rodes
## 2 Vehicles motoritzats de 2 rodes
## 3 Vehicles motoritzats de quatre rodes
## 4 Vehicles motoritzats de 2 rodes
## 5 Vehicles motoritzats de 2 rodes
## 6 Vehicles motoritzats de 2 rodes
## Descripcio_victimitzacio Victimitzacio_est
## 1 Ferit lleu: Amb assistència sanitària en lloc d'accident Ferit lleu
## 2 Ferit lleu: Hospitalització fins a 24h Ferit lleu
## 3 Ferit lleu: Hospitalització fins a 24h Ferit lleu
## 4 Ferit lleu: Hospitalització fins a 24h Ferit lleu
## 5 Ferit lleu: Hospitalització fins a 24h Ferit lleu
## 6 Ferit lleu: Amb assistència sanitària en lloc d'accident Ferit lleu
## Codi_districte Nom_districte Codi_barri Nom_barri
## 1 6 Gracia 31 la Vila de Gràcia
## 2 3 Sants-Montjuic 12 la Marina del Prat Vermell
## 3 10 Sant Marti 68 el Poblenou
## 4 2 Eixample 5 el Fort Pienc
## 5 2 Eixample 7 la Dreta de l'Eixample
## 6 2 Eixample 6 la Sagrada Família
## Codi_carrer Nom_carrer Num_postal.
## 1 206403 Gran de Gràcia / Gràcia 0072 0072
## 2 180 Ramon Albó 0093 0093
## 3 115603 Espronceda / Pallars 0121 0121
## 4 225500 Nàpols 0061X0081X
## 5 28305 Bailèn / Ausiàs Marc 0043 0043
## 6 191204 Padilla / Mallorca 0439B0439B
## Descripcio_dia_setmana NK_Any Mes_any Nom_mes Dia_mes Hora_dia
## 1 Diumenge 2023 1 Gener 1 17
## 2 Diumenge 2023 1 Gener 1 18
## 3 Dilluns 2023 1 Gener 2 10
## 4 Dilluns 2023 1 Gener 2 21
## 5 Dimarts 2023 1 Gener 3 13
## 6 Dimarts 2023 1 Gener 3 14
## Descripcio_torn Numero_morts Numero_lesionats_lleus Numero_lesionats_greus
## 1 Tarda 0 1 0
## 2 Tarda 0 1 0
## 3 Matí 0 3 0
## 4 Tarda 0 1 0
## 5 Matí 0 1 0
## 6 Tarda 0 1 0
## Numero_victimes Numero_vehicles_implicats
## 1 1 2
## 2 1 2
## 3 3 2
## 4 1 2
## 5 1 2
## 6 1 2
## Descripcio_causa_mediata Antiguitat_carnet_min
## 1 Avancament defectuos/improcedent 7
## 2 Manca precaucio incorporacio circulacio 7
## 3 Desobeir semafor 16
## 4 Canvi de carril sense precaucio 16
## 5 Gir indegut o sense precaucio 24
## 6 Desobeir semafor 1
## Vehicles motoritzats de 2 rodes implicats
## 1 1
## 2 2
## 3 0
## 4 1
## 5 2
## 6 0
## Vehicles motoritzats de quatre rodes implicats
## 1 1
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## Vehicles sense permís de conducció implicats
## 1 0
## 2 0
## 3 2
## 4 1
## 5 0
## 6 1
## Vehicles d'Ús Professional implicats Es_laborable Es_mon_treball
## 1 0 No No
## 2 0 No No
## 3 0 Si Si
## 4 0 Si No
## 5 0 Si Si
## 6 0 Si Si
## Es_atropellament Edat_CODIF
## 1 No Conductors d'entre 38 i 49 anys edat
## 2 No Conductors fins 29 anys edat
## 3 No Conductors majors de 49 anys edat
## 4 No Conductors d'entre 38 i 49 anys edat
## 5 No Conductors d'entre 38 i 49 anys edat
## 6 No Conductors d'entre 38 i 49 anys edat
## Interve_conductor_novell Lleus_CODIF Greus_CODIF Morts_CODIF Victimes_CODIF
## 1 No Si No No No
## 2 No Si No No No
## 3 No Si No No Si
## 4 No Si No No No
## 5 No Si No No No
## 6 Si Si No No No
## Vehicles_CODIF VM2R_CODIF VM4R_CODIF V_no_permis_CODIF VUP_CODIF
## 1 Si Si Si No No
## 2 Si Si No No No
## 3 Si No No Si No
## 4 Si Si No Si No
## 5 Si Si No No No
## 6 Si No No Si No
## Es_ocupacional Victimitzacio_est_ Estacio_any
## 1 No Amb assistencia sanitaria en lloc accident Hivern
## 2 No Hospitalitzacio fins a 24h Hivern
## 3 Si Hospitalitzacio fins a 24h Hivern
## 4 No Hospitalitzacio fins a 24h Hivern
## 5 Si Hospitalitzacio fins a 24h Hivern
## 6 Si Amb assistencia sanitaria en lloc accident Hivern
## Es_classificacio_ML
## 1 Si
## 2 Si
## 3 Si
## 4 Si
## 5 Si
## 6 Si
head(final)
## Numero_expedient Es_classificacio_ML Es_mon_treball
## 1 2023S000006 Si No
## 2 2023S000007 Si No
## 3 2023S000008 No Si
## 4 2023S000010 No No
## 5 2023S000011 No No
## 6 2023S000011 No No
## Descripcio_Motiu_desplacament_conductor Descripcio_sexe Edat
## 1 Es desconeix Home 49
## 2 Es desconeix Home 27
## 3 En missió Home 35
## 4 Altres activitats Home 42
## 5 Altres activitats Home 31
## 6 Altres activitats Dona 64
## Edat_CODIF Descripció_tipus_persona
## 1 Conductors d'entre 38 i 49 anys edat Conductor
## 2 Conductors fins 29 anys edat Conductor
## 3 Conductors entre 30 i 37 anys edat Conductor
## 4 Conductors d'entre 38 i 49 anys edat Conductor
## 5 Conductors entre 30 i 37 anys edat Conductor
## 6 Conductors majors de 49 anys edat Conductor
## Desc_Tipus_vehicle_implicat Tipus_vehicle_estandaritzat
## 1 Motocicleta Vehicles motoritzats de 2 rodes
## 2 Motocicleta Vehicles motoritzats de 2 rodes
## 3 Motocicleta Vehicles motoritzats de 2 rodes
## 4 Motocicleta Vehicles motoritzats de 2 rodes
## 5 Motocicleta Vehicles motoritzats de 2 rodes
## 6 Motocicleta Vehicles motoritzats de 2 rodes
## Descripcio_victimitzacio Victimitzacio_est
## 1 Ferit lleu: Amb assistència sanitària en lloc d'accident Ferit lleu
## 2 Ferit lleu: Hospitalització fins a 24h Ferit lleu
## 3 Ferit lleu: Amb assistència sanitària en lloc d'accident Ferit lleu
## 4 Ferit lleu: Amb assistència sanitària en lloc d'accident Ferit lleu
## 5 Ferit lleu: Hospitalització fins a 24h Ferit lleu
## 6 Ferit lleu: Amb assistència sanitària en lloc d'accident Ferit lleu
## Victimitzacio_est_ NK_Any Mes_any Nom_mes Estacio_any
## 1 Amb assistencia sanitaria en lloc accident 2023 1 Gener Hivern
## 2 Hospitalitzacio fins a 24h 2023 1 Gener Hivern
## 3 Amb assistencia sanitaria en lloc accident 2023 1 Gener Hivern
## 4 Amb assistencia sanitaria en lloc accident 2023 1 Gener Hivern
## 5 Hospitalitzacio fins a 24h 2023 1 Gener Hivern
## 6 Amb assistencia sanitaria en lloc accident 2023 1 Gener Hivern
## Dia_mes Descripcio_dia_setmana Hora_dia Descripcio_torn Es_laborable
## 1 1 Diumenge 17 Tarda No
## 2 1 Diumenge 18 Tarda No
## 3 1 Diumenge 14 Tarda No
## 4 2 Dilluns 8 Matí Si
## 5 2 Dilluns 9 Matí Si
## 6 2 Dilluns 9 Matí Si
## Es_ocupacional Codi_districte Nom_districte Codi_barri
## 1 No 6 Gracia 31
## 2 No 3 Sants-Montjuic 12
## 3 No 6 Gracia 31
## 4 Si 5 Sarria-Sant Gervasi 26
## 5 Si 7 Horta-Guinardo 39
## 6 Si 7 Horta-Guinardo 39
## Nom_barri Codi_carrer
## 1 la Vila de Gràcia 206403
## 2 la Marina del Prat Vermell 180
## 3 la Vila de Gràcia 267000
## 4 Sant Gervasi - Galvany 194803
## 5 Sant Genís dels Agudells 295108
## 6 Sant Genís dels Agudells 295108
## Nom_carrer Num_postal.
## 1 Gran de Gràcia / Gràcia 0072 0072
## 2 Ramon Albó 0093 0093
## 3 Riera de Cassoles 0056 0058
## 4 Marc Aureli 0031 0033
## 5 Sant Cugat / Samaria 0002U0002U
## 6 Sant Cugat / Samaria 0002U0002U
## Numero_victimes Victimes_CODIF Numero_morts Morts_CODIF
## 1 1 No 0 No
## 2 1 No 0 No
## 3 1 No 0 No
## 4 1 No 0 No
## 5 3 Si 0 No
## 6 3 Si 0 No
## Numero_lesionats_greus Greus_CODIF Numero_lesionats_lleus Lleus_CODIF
## 1 0 No 1 Si
## 2 0 No 1 Si
## 3 0 No 1 Si
## 4 0 No 1 Si
## 5 0 No 3 Si
## 6 0 No 3 Si
## Numero_vehicles_implicats Vehicles_CODIF Vehicles d'Ús Professional implicats
## 1 2 Si 0
## 2 2 Si 0
## 3 2 Si 0
## 4 2 Si 0
## 5 3 Si 0
## 6 3 Si 0
## VUP_CODIF Vehicles motoritzats de 2 rodes implicats VM2R_CODIF
## 1 No 1 Si
## 2 No 2 Si
## 3 No 2 Si
## 4 No 2 Si
## 5 No 1 Si
## 6 No 1 Si
## Vehicles motoritzats de quatre rodes implicats VM4R_CODIF
## 1 1 Si
## 2 0 No
## 3 0 No
## 4 0 No
## 5 2 Si
## 6 2 Si
## Vehicles sense permís de conducció implicats V_no_permis_CODIF
## 1 0 No
## 2 0 No
## 3 0 No
## 4 0 No
## 5 0 No
## 6 0 No
## Descripcio_causa_mediata Antiguitat_carnet_min
## 1 Avancament defectuos/improcedent 7
## 2 Manca precaucio incorporacio circulacio 7
## 3 Manca atencio a la conduccio 4
## 4 Manca precaucio incorporacio circulacio 13
## 5 Gir indegut o sense precaucio 2
## 6 Gir indegut o sense precaucio 2
## Interve_conductor_novell
## 1 No
## 2 No
## 3 Si
## 4 No
## 5 Si
## 6 Si
## Descripcio_Lloc_atropellament_vianant
## 1 Desconegut
## 2 Desconegut
## 3 A la vorera / Andana
## 4 Desconegut
## 5 Desconegut
## 6 Desconegut
## Es_atropellament Longitud_WGS84 Latitud_WGS84
## 1 No 2.155459 41.39984
## 2 No 2.177056 41.42689
## 3 Si 2.150179 41.40522
## 4 No 2.142298 41.40195
## 5 No 2.138388 41.42207
## 6 No 2.138388 41.42207