1 Introduction and Data Presentation
2 Modeling
3 Model selection
- 3.1 Results Table
- 3.2 Rationale
4 Bibliography

1 Introduction and Data Presentation

Building on the work already completed within the framework of the Data Mining course, taken during the second semester of the 2023-2024 academic year in the Master of Data Science program at the Universitat Oberta de Catalunya (UOC), and with both the first and second parts available on the RPubs repository, I aim to improve the existing modeling and select the most suitable model for imputing the dependent variable “Es_mon_treball”. Finally, I will also prepare the data by adding their respective geocoordinates so that they can be used in geospatial data analysis software. Since some of the tasks have already been presented and discussed in the previous two parts, I will avoid repeating sections with code that have already been detailed in the two aforementioned articles.

In this article, we will perform the tasks of standardization, cleaning, and coding for the dataset corresponding to those drivers whose reason for travel was recorded as “Es desconeix” - “Unknown”- in the files opened by the Barcelona Local Policia or Guàrdia Urbana. Given the repetitive nature of this section compared to what was detailed in the first part, it will be avoided to repeat it again.

maskDistDescon <- dades$Nom_districte == "Desconegut"
dadesDistDescon <- dades[maskDistDescon, ]
dades_ <- dades[!maskDistDescon, ]

maskDistDescon <- dadesDesconegut$Nom_districte == "Desconegut"
dadesDistDescon <- dadesDesconegut[maskDistDescon, ]
dadesDesconegut_ <- dadesDesconegut[!maskDistDescon, ]

nMotiuConegut <- nrow(dades_)
nMotiuDesconegut <- nrow(dadesDesconegut_)
nConductors <- nMotiuConegut + nMotiuDesconegut
propMotiuConegut <- round((nMotiuConegut /nConductors) * 100, 3)
propMotiuDesconegut <- round((nMotiuDesconegut /nConductors) * 100, 3)
  
print(glue("A total of {nConductors} drivers are considered, of whom {nMotiuConegut} ({propMotiuConegut}%) have a known reason for their travel, while for {nMotiuDesconegut} ({propMotiuDesconegut}%), the reason for their travel is unknown."))

## A total of 5019 drivers are considered, of whom 2823 (56.246%) have a known reason for their travel, while for 2196 (43.754%), the reason for their travel is unknown.

In contrast, it is relevant to note that, unlike the coding task performed in the second part for the variable “Es_ocupacional,” which identifies whether the time of the accident occurred during a period with a high proportion of work-related travel on weekdays, I will now update the data with the results from the 2023 Weekday Mobility Survey - Enquesta de Mobilitat en dia feiner- corresponding to 2023. This survey provides information on time slots and reasons for travel (INSTITUT METRÒPOLI, 2023: 49). In contrast to the conclusions of the aforementioned study, we find the number of work-related trips between 5:00 and 16:00 to be significant.

dades_["Es_ocupacional"] <- "No"
horaOcupacional <- c(5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
maskHorari <- dades_$Hora_dia %in% horaOcupacional
maskLaborable <- dades_$Es_laborable == "Si"
dades_$Es_ocupacional[maskHorari & maskLaborable] <- "Si"

aggEs_ocupacional <- aggregate(dades_$Numero_expedient,
                       by=dades_["Es_ocupacional"], FUN=length)

ggplot(aggEs_ocupacional, aes(x="", y=x, fill=Es_ocupacional)) +
  geom_bar(width = 1, stat = "identity", color = "black") +
  ggtitle(glue("Fig. 1.1. Stacked Bar Chart of Driver Count Based on
Whether the Accident Occurred During Occupational Hours and
the Reason for Travel is Known.")) + 
  xlab(glue("Reason for Travel is Known")) + 
  ylab("Count")

nAccidents <- length(unique(dades_$Numero_expedient))
nNo_Es_ocupacional <- aggEs_ocupacional[1, 2]
propNo_Es_ocupacional <- round((nNo_Es_ocupacional / nMotiuConegut) * 100, 3)
nSi_Es_ocupacional <- aggEs_ocupacional[2, 2]
propSi_Es_ocupacional <- round((nSi_Es_ocupacional / nMotiuConegut) * 100, 3)

print(glue("Of the subset of drivers for whom the reason for their travel is known, {nSi_Es_ocupacional} ({propSi_Es_ocupacional}%) were traveling during occupational hours, while {nNo_Es_ocupacional} ({propNo_Es_ocupacional}%) were not, with a total of {nAccidents} accidents recorded."))

## Of the subset of drivers for whom the reason for their travel is known, 1272 (45.058%) were traveling during occupational hours, while 1551 (54.942%) were not, with a total of 2561 accidents recorded.

dadesDesconegut_["Es_ocupacional"] <- "No"
horaOcupacional <- c(5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
maskHorari <- dadesDesconegut_$Hora_dia %in% horaOcupacional
maskLaborable <- dadesDesconegut_$Es_laborable == "Si"
dadesDesconegut_$Es_ocupacional[maskHorari & maskLaborable] <- "Si"

aggEs_ocupacional <- aggregate(dadesDesconegut_$Numero_expedient,
                       by=dadesDesconegut_["Es_ocupacional"], FUN=length)

ggplot(aggEs_ocupacional, aes(x="", y=x, fill=Es_ocupacional)) +
  geom_bar(width = 1, stat = "identity", color = "black") +
  ggtitle(glue("Fig. 1.2. Stacked Bar Chart of Driver Count Based on
Whether the Accident Occurred During Occupational Hours and
the Reason for Travel is Unknown.")) + 
  xlab(glue("Reason for Travel is Unknown")) + 
  ylab("Count")

nAccidents <- length(unique(dadesDesconegut_$Numero_expedient))
nNo_Es_ocupacional <- aggEs_ocupacional[1, 2]
propNo_Es_ocupacional <- round((nNo_Es_ocupacional / nMotiuDesconegut) * 100, 3)
nSi_Es_ocupacional <- aggEs_ocupacional[2, 2]
propSi_Es_ocupacional <- round((nSi_Es_ocupacional / nMotiuDesconegut) * 100, 3)

print(glue("For the drivers with an unknown travel reason, {nSi_Es_ocupacional} ({propSi_Es_ocupacional}%) were traveling during occupational hours, while {nNo_Es_ocupacional} ({propNo_Es_ocupacional}%) were not, with a total of {nAccidents} accidents recorded."))

## For the drivers with an unknown travel reason, 1150 (52.368%) were traveling during occupational hours, while 1046 (47.632%) were not, with a total of 2029 accidents recorded.

Given the observable result in Figures 1.1 and 1.2, it is noted that the proportion of drivers injured in accidents that occurred during an occupational time slot or not reverses depending on whether the reason for their travel is known or not.

We also check for any relevant changes in the class sizes in the victimization description variable, as shown, since in Figures 15.1 and 15.2 of the first part, we already verified that the territorial distribution was not affected by data cleaning:

dades_["Victimitzacio_est_"] <- ""
dades_$Victimitzacio_est_ <- sapply(strsplit(dades_$Descripcio_victimitzacio,
                                ": "), "[", 2)
maskMorts <- grepl("Mort", dades_$Descripcio_victimitzacio)
dades_$Victimitzacio_est_[maskMorts] <- "Mort"
dades_$Victimitzacio_est_ <- str_to_sentence(dades_$Victimitzacio_est_)

aggVictmitzacio <- aggregate(dades_$Numero_expedient,
                             by=dades_["Victimitzacio_est_"],
                             FUN=length)

ggplot(aggVictmitzacio, aes(x="", y=x, fill=Victimitzacio_est_)) +
  geom_bar(width = 1, stat = "identity", color = "black") +
  ggtitle(glue("Fig. 2.1. Stacked Bar Chart of Driver Count by Type of Victimization
for Those Whose Reason for Travel Was Known.")) + 
  xlab(glue("Reason for Travel is Known")) + 
  ylab("Count")

dadesDesconegut_["Victimitzacio_est_"] <- ""
dadesDesconegut_$Victimitzacio_est_ <- sapply(strsplit(dadesDesconegut_$Descripcio_victimitzacio,
                                ": "), "[", 2)
maskMorts <- grepl("Mort", dadesDesconegut_$Descripcio_victimitzacio)
dadesDesconegut_$Victimitzacio_est_[maskMorts] <- "Mort"
dadesDesconegut_$Victimitzacio_est_ <- str_to_sentence(dadesDesconegut_$Victimitzacio_est_)

aggVictmitzacio <- aggregate(dadesDesconegut_$Numero_expedient,
                             by=dadesDesconegut_["Victimitzacio_est_"],
                             FUN=length)

ggplot(aggVictmitzacio, aes(x="", y=x, fill=Victimitzacio_est_)) +
  geom_bar(width = 1, stat = "identity", color = "black") +
  ggtitle(glue("Fig. 2.2. Stacked Bar Chart of Driver Count by Type of Victimization
for Those Whose Reason for Travel Was Unknown.")) + 
  xlab(glue("Reason for Travel is Unknown")) + 
  ylab("Count")

It is noted when comparing Figures 2.1 and 2.2 that the proportion of the groups is practically identical, both in the case of drivers whose reason for travel was known and in those whose reason was unknown.

We also proceed to code a new variable based on the month the accident occurred, classifying them by the corresponding quarters of the seasons of the year: Winter - “Hivern”-, Spring - “Primavera”-, Summer - “Estiu”-, and Autumn - “Tardor”.

dades_['Estacio_any'] <- "Estiu"
dades_$Estacio_any[dades_$Mes_any <= 3] <- "Hivern"
dades_$Estacio_any[dades_$Mes_any > 3 & dades_$Mes_any <= 6] <- "Primavera"
dades_$Estacio_any[dades_$Mes_any > 9] <- 'Tardor'

aggEstacions <- aggregate(dades_$Numero_expedient,
                             by=dades_["Estacio_any"],
                             FUN=length)

ggplot(aggEstacions, aes(x="", y=x, fill=Estacio_any)) +
  geom_bar(width = 1, stat = "identity", color = "black") +
  ggtitle(glue("Fig. 3.1. Stacked Bar Chart of Driver Count by Season of the
Accident for Those Whose Reason for Travel Was Known.")) + 
  xlab(glue("Reason for Travel is Known")) + 
  ylab("Count")

dades_$Tipus_vehicle_estandaritzat <- gsub("d'", "",
                                          dades_$Tipus_vehicle_estandaritzat)
dades_$Victimitzacio_est_ <- gsub("d'", "",
                                  dades_$Victimitzacio_est_)

dades_$Descripcio_causa_mediata <- gsub("d'", "",
                                  dades_$Descripcio_causa_mediata)

dades_$Nom_mes <- gsub("ç", "s", dades_$Nom_mes)

dades_$Tipus_vehicle_estandaritzat <- 
  stri_trans_general(str = dades_$Tipus_vehicle_estandaritzat,
                   id = "Latin-ASCII")
dades_$Victimitzacio_est_ <- 
  stri_trans_general(str = dades_$Victimitzacio_est_,
                   id = "Latin-ASCII")
dades_$Descripcio_causa_mediata <- 
  stri_trans_general(str = dades_$Descripcio_causa_mediata,
                   id = "Latin-ASCII")
dades_$Nom_districte <- 
  stri_trans_general(str = dades_$Nom_districte,
                   id = "Latin-ASCII")

varsSup <- c("Descripcio_sexe", "Tipus_vehicle_estandaritzat",
               "Victimitzacio_est_", "Nom_mes", "Nom_districte",
             "Descripcio_causa_mediata", "Estacio_any",
              "Victimes_CODIF", "Numero_morts",
             "Numero_lesionats_lleus", "Numero_lesionats_greus",
             "Numero_victimes", "Numero_vehicles_implicats",
               "Vehicles_CODIF", "VM2R_CODIF", "VM4R_CODIF",
             "V_no_permis_CODIF", "VUP_CODIF", "Interve_conductor_novell",
             "Edat_CODIF", "Edat", "Es_laborable", "Es_ocupacional",
             "Lleus_CODIF", "Greus_CODIF", "Morts_CODIF", "Es_atropellament",
             "Es_mon_treball")

dadesSup <- dades_[varsSup]

dadesDesconegut_['Estacio_any'] <- "Estiu"
dadesDesconegut_$Estacio_any[dadesDesconegut_$Mes_any <= 3] <- "Hivern"
dadesDesconegut_$Estacio_any[dadesDesconegut_$Mes_any > 3 & dadesDesconegut_$Mes_any <= 6] <- "Primavera"
dadesDesconegut_$Estacio_any[dadesDesconegut_$Mes_any > 9] <- 'Tardor'

aggEstacions <- aggregate(dadesDesconegut_$Numero_expedient,
                             by=dadesDesconegut_["Estacio_any"],
                             FUN=length)

ggplot(aggEstacions, aes(x="", y=x, fill=Estacio_any)) +
  geom_bar(width = 1, stat = "identity", color = "black") +
  ggtitle(glue("Fig. 3.2. Stacked Bar Chart of Driver Count by Season of the
Accident for Those Whose Reason for Travel Was Unknown.")) + 
  xlab(glue("Reason for Travel is Unknown")) + 
  ylab("Count")

dadesDesconegut_$Tipus_vehicle_estandaritzat <- gsub("d'", "",
                                          dadesDesconegut_$Tipus_vehicle_estandaritzat)
dadesDesconegut_$Victimitzacio_est_ <- gsub("d'", "",
                                  dadesDesconegut_$Victimitzacio_est_)

dadesDesconegut_$Descripcio_causa_mediata <- gsub("d'", "",
                                  dadesDesconegut_$Descripcio_causa_mediata)

dadesDesconegut_$Nom_mes <- gsub("ç", "s", dadesDesconegut_$Nom_mes)

dadesDesconegut_$Tipus_vehicle_estandaritzat <- 
  stri_trans_general(str = dadesDesconegut_$Tipus_vehicle_estandaritzat,
                   id = "Latin-ASCII")
dadesDesconegut_$Victimitzacio_est_ <- 
  stri_trans_general(str = dadesDesconegut_$Victimitzacio_est_,
                   id = "Latin-ASCII")
dadesDesconegut_$Descripcio_causa_mediata <- 
  stri_trans_general(str = dadesDesconegut_$Descripcio_causa_mediata,
                   id = "Latin-ASCII")
dadesDesconegut_$Nom_districte <- 
  stri_trans_general(str = dadesDesconegut_$Nom_districte,
                   id = "Latin-ASCII")

varsSup <- c("Descripcio_sexe", "Tipus_vehicle_estandaritzat",
               "Victimitzacio_est_", "Nom_mes", "Nom_districte",
             "Descripcio_causa_mediata", "Estacio_any",
              "Victimes_CODIF", "Numero_morts",
             "Numero_lesionats_lleus", "Numero_lesionats_greus",
             "Numero_victimes", "Numero_vehicles_implicats",
               "Vehicles_CODIF", "VM2R_CODIF", "VM4R_CODIF",
             "V_no_permis_CODIF", "VUP_CODIF", "Interve_conductor_novell",
             "Edat_CODIF", "Edat", "Es_laborable", "Es_ocupacional",
             "Lleus_CODIF", "Greus_CODIF", "Morts_CODIF", "Es_atropellament",
             "Es_mon_treball")

dadesDesconegutSup <- dadesDesconegut_[varsSup]

Once again, it is observed that there is no substantial difference in the distribution of groups between both subsets of data, with both cases seeing the highest number of injured drivers during spring. However, in the case of drivers whose reason for travel was unknown, there is a slight increase in the proportion during the winter season.

2 Modeling

In the conclusions of the second part of this series, we noted several shortcomings in the modeling and its accuracy, possibly due to the fact that there were a very high number of variables in play. For this reason, in this section, we will first attempt to build a classification model using the Logistic Regression algorithm to identify which of the total of 27 usable variables are statistically significant for identifying the dependent variable ‘Es_mon_treball’. Subsequently, based on the results obtained from this first algorithm, we will construct various models with different types of algorithms such as unpruned and pruned decision trees, and with various parameterizations in addition to the random forest and XGBoost models.

2.1 Logistic Regression

From the decomposition of the principal elements carried out in section 5 of the first part, we already know that the numerical variables, with the exception of ‘Numero_victimes’ and ‘Numero_lesionats_greus’, show little correlation with each other. In any case, we will seek to introduce them into the model both in their continuous version and in the encoded form. We will then proceed to perform the analysis of correlations between variables using the Logistic Regression algorithm:

dadesSup["EMT_recodificat"] <- ifelse(dadesSup$Es_mon_treball=="Si", 1, 0)

cols_ <- colnames(dadesSup)
df_dtype_ <- as.data.frame(sapply(dadesSup, class))

for (col in cols_){
  data_type <- paste(df_dtype_[col, 1])
  n_domini <- length(unique(dadesSup[, col]))

  if (data_type == "character"){
    dadesSup[, col] <- as.factor(dadesSup[, col])
  }
  else if (n_domini <= 2) {
    dadesSup[, col] <- as.factor(dadesSup[, col])
  }
}

dadesSup$Numero_morts <- as.integer(dadesSup$Numero_morts)

cols_ <- colnames(dadesDesconegutSup)
df_dtype_ <- as.data.frame(sapply(dadesDesconegutSup, class))

for (col in cols_){
  data_type <- paste(df_dtype_[col, 1])
  n_domini <- length(unique(dadesDesconegutSup[, col]))

  if (data_type == "character"){
    dadesDesconegutSup[, col] <- as.factor(dadesDesconegutSup[, col])
  }
  else if (n_domini <= 2) {
    dadesDesconegutSup[, col] <- as.factor(dadesDesconegutSup[, col])
  }
}

dadesDesconegutSup$Numero_morts <- as.integer(dadesDesconegutSup$Numero_morts)

model <- glm(EMT_recodificat ~ Descripcio_sexe + 
               Edat_CODIF + Edat + Tipus_vehicle_estandaritzat +
               Victimitzacio_est_ +
               Descripcio_causa_mediata + Estacio_any +
               Numero_morts + Morts_CODIF +
               Numero_lesionats_lleus + Lleus_CODIF +
               Numero_lesionats_greus + Greus_CODIF +
               Numero_vehicles_implicats + Vehicles_CODIF +
               Numero_victimes + Victimes_CODIF +
               Nom_districte +  VM2R_CODIF + VM4R_CODIF + V_no_permis_CODIF +
               VUP_CODIF + Interve_conductor_novell + Es_atropellament +
               Es_laborable + Es_ocupacional,
                data=dadesSup, family=binomial(link=logit), na.action = NULL)

summary(model)

## 
## Call:
## glm(formula = EMT_recodificat ~ Descripcio_sexe + Edat_CODIF + 
##     Edat + Tipus_vehicle_estandaritzat + Victimitzacio_est_ + 
##     Descripcio_causa_mediata + Estacio_any + Numero_morts + Morts_CODIF + 
##     Numero_lesionats_lleus + Lleus_CODIF + Numero_lesionats_greus + 
##     Greus_CODIF + Numero_vehicles_implicats + Vehicles_CODIF + 
##     Numero_victimes + Victimes_CODIF + Nom_districte + VM2R_CODIF + 
##     VM4R_CODIF + V_no_permis_CODIF + VUP_CODIF + Interve_conductor_novell + 
##     Es_atropellament + Es_laborable + Es_ocupacional, family = binomial(link = logit), 
##     data = dadesSup, na.action = NULL)
## 
## Coefficients: (2 not defined because of singularities)
##                                                                   Estimate
## (Intercept)                                                      -3.141604
## Descripcio_sexeHome                                               0.057154
## Edat_CODIFConductors entre 30 i 37 anys edat                     -0.099422
## Edat_CODIFConductors fins 29 anys edat                           -0.595047
## Edat_CODIFConductors majors de 49 anys edat                       0.201110
## Edat                                                             -0.018628
## Tipus_vehicle_estandaritzatVehicles motoritzats de quatre rodes   0.167033
## Tipus_vehicle_estandaritzatVehicles sense permis de conduccio    -0.190535
## Tipus_vehicle_estandaritzatVehicles Us Professional               4.530266
## Victimitzacio_est_Hospitalitzacio fins a 24h                      0.159140
## Victimitzacio_est_Hospitalitzacio superior a 24h                  0.611889
## Victimitzacio_est_Mort                                           -0.896369
## Victimitzacio_est_Rebutja assistencia sanitaria                   0.284140
## Descripcio_causa_mediataCanvi de carril sense precaucio          -0.227889
## Descripcio_causa_mediataDesobeir altres senyals                  -0.402497
## Descripcio_causa_mediataDesobeir semafor                         -0.126338
## Descripcio_causa_mediataEnvair calcada contraria                 -1.181084
## Descripcio_causa_mediataFallada mecanica o avaria               -10.738271
## Descripcio_causa_mediataGir indegut o sense precaucio            -0.339264
## Descripcio_causa_mediataManca atencio a la conduccio             -0.277985
## Descripcio_causa_mediataManca precaucio efectuar marxa enrera    -0.285622
## Descripcio_causa_mediataManca precaucio incorporacio circulacio  -0.220823
## Descripcio_causa_mediataNo cedir la dreta                        -0.181772
## Descripcio_causa_mediataNo respectar distancies                  -0.497367
## Descripcio_causa_mediataNo respectat pas de vianants             -0.022101
## Estacio_anyHivern                                                 0.007894
## Estacio_anyPrimavera                                              0.233228
## Estacio_anyTardor                                                 0.038313
## Numero_morts                                                      1.657122
## Morts_CODIFSi                                                           NA
## Numero_lesionats_lleus                                           -0.020486
## Lleus_CODIFSi                                                     0.552134
## Numero_lesionats_greus                                            1.202418
## Greus_CODIFSi                                                    -1.876713
## Numero_vehicles_implicats                                        -0.033703
## Vehicles_CODIFSi                                                  0.799640
## Numero_victimes                                                         NA
## Victimes_CODIFSi                                                 -0.383877
## Nom_districteEixample                                             0.070432
## Nom_districteGracia                                               0.020947
## Nom_districteHorta-Guinardo                                      -0.344914
## Nom_districteLes Corts                                            0.450337
## Nom_districteNou Barris                                          -0.086124
## Nom_districteSant Andreu                                         -0.328615
## Nom_districteSant Marti                                           0.113460
## Nom_districteSants-Montjuic                                       0.345712
## Nom_districteSarria-Sant Gervasi                                 -0.011230
## VM2R_CODIFSi                                                      0.103620
## VM4R_CODIFSi                                                      0.128787
## V_no_permis_CODIFSi                                               0.058489
## VUP_CODIFSi                                                       0.347619
## Interve_conductor_novellSi                                       -0.031369
## Es_atropellamentSi                                                0.471575
## Es_laborableSi                                                    0.621579
## Es_ocupacionalSi                                                  0.872729
##                                                                 Std. Error
## (Intercept)                                                       1.549508
## Descripcio_sexeHome                                               0.092090
## Edat_CODIFConductors entre 30 i 37 anys edat                      0.152061
## Edat_CODIFConductors fins 29 anys edat                            0.210195
## Edat_CODIFConductors majors de 49 anys edat                       0.177586
## Edat                                                              0.009312
## Tipus_vehicle_estandaritzatVehicles motoritzats de quatre rodes   0.124465
## Tipus_vehicle_estandaritzatVehicles sense permis de conduccio     0.135211
## Tipus_vehicle_estandaritzatVehicles Us Professional               0.736989
## Victimitzacio_est_Hospitalitzacio fins a 24h                      0.101444
## Victimitzacio_est_Hospitalitzacio superior a 24h                  0.743511
## Victimitzacio_est_Mort                                            1.733621
## Victimitzacio_est_Rebutja assistencia sanitaria                   0.169696
## Descripcio_causa_mediataCanvi de carril sense precaucio           0.221669
## Descripcio_causa_mediataDesobeir altres senyals                   0.246703
## Descripcio_causa_mediataDesobeir semafor                          0.227335
## Descripcio_causa_mediataEnvair calcada contraria                  0.978785
## Descripcio_causa_mediataFallada mecanica o avaria               196.968101
## Descripcio_causa_mediataGir indegut o sense precaucio             0.220003
## Descripcio_causa_mediataManca atencio a la conduccio              0.206952
## Descripcio_causa_mediataManca precaucio efectuar marxa enrera     0.436302
## Descripcio_causa_mediataManca precaucio incorporacio circulacio   0.274093
## Descripcio_causa_mediataNo cedir la dreta                         0.371908
## Descripcio_causa_mediataNo respectar distancies                   0.213964
## Descripcio_causa_mediataNo respectat pas de vianants              0.607251
## Estacio_anyHivern                                                 0.121827
## Estacio_anyPrimavera                                              0.115624
## Estacio_anyTardor                                                 0.119853
## Numero_morts                                                      1.320503
## Morts_CODIFSi                                                           NA
## Numero_lesionats_lleus                                            0.073968
## Lleus_CODIFSi                                                     0.601838
## Numero_lesionats_greus                                            0.578543
## Greus_CODIFSi                                                     0.780793
## Numero_vehicles_implicats                                         0.090138
## Vehicles_CODIFSi                                                  0.283757
## Numero_victimes                                                         NA
## Victimes_CODIFSi                                                  0.140993
## Nom_districteEixample                                             0.220617
## Nom_districteGracia                                               0.284376
## Nom_districteHorta-Guinardo                                       0.262054
## Nom_districteLes Corts                                            0.257007
## Nom_districteNou Barris                                           0.278184
## Nom_districteSant Andreu                                          0.261826
## Nom_districteSant Marti                                           0.234503
## Nom_districteSants-Montjuic                                       0.242471
## Nom_districteSarria-Sant Gervasi                                  0.241102
## VM2R_CODIFSi                                                      0.128622
## VM4R_CODIFSi                                                      0.108049
## V_no_permis_CODIFSi                                               0.102270
## VUP_CODIFSi                                                       0.227924
## Interve_conductor_novellSi                                        0.096416
## Es_atropellamentSi                                                0.415103
## Es_laborableSi                                                    0.118900
## Es_ocupacionalSi                                                  0.095891
##                                                                 z value
## (Intercept)                                                      -2.027
## Descripcio_sexeHome                                               0.621
## Edat_CODIFConductors entre 30 i 37 anys edat                     -0.654
## Edat_CODIFConductors fins 29 anys edat                           -2.831
## Edat_CODIFConductors majors de 49 anys edat                       1.132
## Edat                                                             -2.000
## Tipus_vehicle_estandaritzatVehicles motoritzats de quatre rodes   1.342
## Tipus_vehicle_estandaritzatVehicles sense permis de conduccio    -1.409
## Tipus_vehicle_estandaritzatVehicles Us Professional               6.147
## Victimitzacio_est_Hospitalitzacio fins a 24h                      1.569
## Victimitzacio_est_Hospitalitzacio superior a 24h                  0.823
## Victimitzacio_est_Mort                                           -0.517
## Victimitzacio_est_Rebutja assistencia sanitaria                   1.674
## Descripcio_causa_mediataCanvi de carril sense precaucio          -1.028
## Descripcio_causa_mediataDesobeir altres senyals                  -1.632
## Descripcio_causa_mediataDesobeir semafor                         -0.556
## Descripcio_causa_mediataEnvair calcada contraria                 -1.207
## Descripcio_causa_mediataFallada mecanica o avaria                -0.055
## Descripcio_causa_mediataGir indegut o sense precaucio            -1.542
## Descripcio_causa_mediataManca atencio a la conduccio             -1.343
## Descripcio_causa_mediataManca precaucio efectuar marxa enrera    -0.655
## Descripcio_causa_mediataManca precaucio incorporacio circulacio  -0.806
## Descripcio_causa_mediataNo cedir la dreta                        -0.489
## Descripcio_causa_mediataNo respectar distancies                  -2.325
## Descripcio_causa_mediataNo respectat pas de vianants             -0.036
## Estacio_anyHivern                                                 0.065
## Estacio_anyPrimavera                                              2.017
## Estacio_anyTardor                                                 0.320
## Numero_morts                                                      1.255
## Morts_CODIFSi                                                        NA
## Numero_lesionats_lleus                                           -0.277
## Lleus_CODIFSi                                                     0.917
## Numero_lesionats_greus                                            2.078
## Greus_CODIFSi                                                    -2.404
## Numero_vehicles_implicats                                        -0.374
## Vehicles_CODIFSi                                                  2.818
## Numero_victimes                                                      NA
## Victimes_CODIFSi                                                 -2.723
## Nom_districteEixample                                             0.319
## Nom_districteGracia                                               0.074
## Nom_districteHorta-Guinardo                                      -1.316
## Nom_districteLes Corts                                            1.752
## Nom_districteNou Barris                                          -0.310
## Nom_districteSant Andreu                                         -1.255
## Nom_districteSant Marti                                           0.484
## Nom_districteSants-Montjuic                                       1.426
## Nom_districteSarria-Sant Gervasi                                 -0.047
## VM2R_CODIFSi                                                      0.806
## VM4R_CODIFSi                                                      1.192
## V_no_permis_CODIFSi                                               0.572
## VUP_CODIFSi                                                       1.525
## Interve_conductor_novellSi                                       -0.325
## Es_atropellamentSi                                                1.136
## Es_laborableSi                                                    5.228
## Es_ocupacionalSi                                                  9.101
##                                                                 Pr(>|z|)    
## (Intercept)                                                      0.04261 *  
## Descripcio_sexeHome                                              0.53484    
## Edat_CODIFConductors entre 30 i 37 anys edat                     0.51322    
## Edat_CODIFConductors fins 29 anys edat                           0.00464 ** 
## Edat_CODIFConductors majors de 49 anys edat                      0.25744    
## Edat                                                             0.04546 *  
## Tipus_vehicle_estandaritzatVehicles motoritzats de quatre rodes  0.17959    
## Tipus_vehicle_estandaritzatVehicles sense permis de conduccio    0.15878    
## Tipus_vehicle_estandaritzatVehicles Us Professional             7.90e-10 ***
## Victimitzacio_est_Hospitalitzacio fins a 24h                     0.11671    
## Victimitzacio_est_Hospitalitzacio superior a 24h                 0.41052    
## Victimitzacio_est_Mort                                           0.60512    
## Victimitzacio_est_Rebutja assistencia sanitaria                  0.09405 .  
## Descripcio_causa_mediataCanvi de carril sense precaucio          0.30392    
## Descripcio_causa_mediataDesobeir altres senyals                  0.10278    
## Descripcio_causa_mediataDesobeir semafor                         0.57839    
## Descripcio_causa_mediataEnvair calcada contraria                 0.22755    
## Descripcio_causa_mediataFallada mecanica o avaria                0.95652    
## Descripcio_causa_mediataGir indegut o sense precaucio            0.12305    
## Descripcio_causa_mediataManca atencio a la conduccio             0.17919    
## Descripcio_causa_mediataManca precaucio efectuar marxa enrera    0.51270    
## Descripcio_causa_mediataManca precaucio incorporacio circulacio  0.42044    
## Descripcio_causa_mediataNo cedir la dreta                        0.62501    
## Descripcio_causa_mediataNo respectar distancies                  0.02010 *  
## Descripcio_causa_mediataNo respectat pas de vianants             0.97097    
## Estacio_anyHivern                                                0.94834    
## Estacio_anyPrimavera                                             0.04368 *  
## Estacio_anyTardor                                                0.74922    
## Numero_morts                                                     0.20951    
## Morts_CODIFSi                                                         NA    
## Numero_lesionats_lleus                                           0.78182    
## Lleus_CODIFSi                                                    0.35893    
## Numero_lesionats_greus                                           0.03768 *  
## Greus_CODIFSi                                                    0.01623 *  
## Numero_vehicles_implicats                                        0.70847    
## Vehicles_CODIFSi                                                 0.00483 ** 
## Numero_victimes                                                       NA    
## Victimes_CODIFSi                                                 0.00648 ** 
## Nom_districteEixample                                            0.74953    
## Nom_districteGracia                                              0.94128    
## Nom_districteHorta-Guinardo                                      0.18811    
## Nom_districteLes Corts                                           0.07973 .  
## Nom_districteNou Barris                                          0.75687    
## Nom_districteSant Andreu                                         0.20945    
## Nom_districteSant Marti                                          0.62850    
## Nom_districteSants-Montjuic                                      0.15393    
## Nom_districteSarria-Sant Gervasi                                 0.96285    
## VM2R_CODIFSi                                                     0.42046    
## VM4R_CODIFSi                                                     0.23329    
## V_no_permis_CODIFSi                                              0.56739    
## VUP_CODIFSi                                                      0.12722    
## Interve_conductor_novellSi                                       0.74491    
## Es_atropellamentSi                                               0.25594    
## Es_laborableSi                                                  1.72e-07 ***
## Es_ocupacionalSi                                                 < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3900.6  on 2822  degrees of freedom
## Residual deviance: 3432.0  on 2770  degrees of freedom
## AIC: 3538
## 
## Number of Fisher Scoring iterations: 10

If we establish that the threshold for accepting the variable is 0.05 > p-value (Pr(>|z|)) to consider them as having statistically significant explanatory value, based on the results, we should only accept 11 of the variables considered in the Logistic Regression algorithm. These include 2 continuous numerical variables and 9 categorical variables, which we detail in the following table:

Variable Name	Data Type	Variable Description
“Tipus_veh icle_estandaritzat”	Ca tegorical	Standardization into just four categories for the ‘Tipus_vehicle’ variable: ‘Professional use vehicles’ (“Vehicles d’ús professional”), ‘Four-wheeled motor vehicles’ (“Vehicles motoritzats de quatre rodes”), ‘Two-wheeled motor vehicles’ (“Vehicles motoritzats de dues rodes”), and ‘Vehicles without a driving license’ (“Vehicles sense permís de conducció”).
“Descri pcio_causa_mediata”	Ca tegorical	Details the immediate cause identified by the corresponding Guàrdia Urbana patrol that filed the accident report, referring to the type of maneuver or immediate circumstance that caused the accident.
“Nume ro_lesionats_greus”	Continous numerical	Number of people injured in the accident who required hospitalization for more than 24 hours.
“Greus_CODIF”	Ca tegorical	Determines whether there was one or more seriously injured individuals in the accident.
“Nom_mes”	Ca tegorical	Month of the year in which the accident occurred.
“Victimes_CODIF”	Ca tegorical	Determines whether there were two or more victims in the accident.
“Vehicles_CODIF”	Ca tegorical	Determines whether there were two or more vehicles involved in the accident.
“Edat”	Continous numerical	Complete years of age of the driver on the date of the accident.
“Edat_CODIF”	Ca tegorical	Age group coding based on interquartile ranges: under 29 years, from 29 to 37 years, from 37 to 49 years, and over 49 years.
“Es_laborable”	Ca tegorical	Determines if the date of the accident was a working day during the year 2023.
“Es_ocupacional”	Ca tegorical	Determines if the time of the accident was during occupational hours, from 5 AM to 4 PM, on a working day.

We also check the precision of this model by testing it with the same data on which it was trained, measuring its sensitivity — the degree to which it correctly identifies that the travel was for a work-related reason— and its specificity— the degree to which it correctly identifies that the travel was not for a work-related reason:

pred <- predict(model, newdata=dadesSup, type='response')

dadesSup$prediction <- ifelse(pred < 0.5 ,0, 1)
prediccio <- as.factor(dadesSup$prediction)
valor_esperat <- dadesSup$EMT_recodificat

tb <- table(valor_esperat, prediccio); tb

##              prediccio
## valor_esperat    0    1
##             0 1061  446
##             1  508  808

# TN: tb[1]
# FP: tb[2]
# FN: tb[3]
# TP: tb[4]

# accuracy
precisio <- (tb[4] + tb[1]) / (tb[1] + tb[2] + tb[3] + tb[4])
print(glue("\n\nLa precisió de la predicció és {round(precisio, 4)}."))

## 
## La precisió de la predicció és 0.6621.

# sensitivity
sensitivitat <- tb[4] / (tb[4] + tb[3])
print(glue("\n\nLa sensitivitat de la predicció és {round(sensitivitat, 4)}."))

## 
## La sensitivitat de la predicció és 0.6443.

# specifity
especificitat <- tb[1] / (tb[1] + tb[2])
print(glue("\n\nLa especificitat de la predicció és {round(especificitat, 4)}."))

## 
## La especificitat de la predicció és 0.6762.

We observed that the model has moderate predictive capacity, although its specificity is higher than its sensitivity, a trait that usually indicates model overfitting. For this reason, and because the objective of using this logistic regression algorithm was to select, based on their statistical significance, which variables potentially have predictive capacity, and since it has not been tested with a part of the data that was not included in the model, we will discard it.

Also, in the case of variables that result in NA, which would denote that their values are zero or, according to another interpretations, suggest multicollinearity with other variables or classes, we will interpret these as not being explanatory either.

On the other hand, we also observe that four of the variables, such as ‘Numero_lesionats_greus’ and ‘Greus_CODIF’ as well as ‘Edat’ and ‘Edat_CODIF’, derive from each other. These four variables will be considered in the following modeling with the XGBoost algorithm, but for decision trees, only the two categorical variables ‘Greus_CODIF’ and ‘Edat_CODIF’ can be considered.

2.2 XGBoost (eXtreme Gradient Boosting)

Being implemented the XGBoost algorithm following the example described in the work of P. Bruce, A. Bruce, and P. Gedeck to work with boosting algorithms (BRUCE; BRUCE & GEDECK, 2022: 263-265), as proposed in the conclusions of the second part, applying a workflow to execute 500 different models with various random parameterizations, while automating the result control process to select the most optimal model, as well as performing these checks with various portions of the training data set, following the detailed example by T. Pham on RPubs. Subsequently, we will perform the corresponding classification with the dataset of drivers involved in accidents whose reason for travel is unknown.

We will also build the model with the 11 variables which, based on the analysis performed in the previous section, would be explanatory for understanding the outcome of the dependent variable ‘Es_mon_treball’; subsequently, we will also check which set of related variables results in a better outcome in the classification on the testing data subsets created during the flow.

First, we will consider all 11 variables:

data.xgb <- dadesSup[, c("Tipus_vehicle_estandaritzat",
                       "Descripcio_causa_mediata",
                       "Nom_mes", 
                       "Numero_lesionats_greus",
                       "Greus_CODIF",
                       "Victimes_CODIF",
                       "Vehicles_CODIF",
                       "Edat_CODIF", "Edat", "Es_laborable", "Es_ocupacional",
                     "Es_mon_treball")]

set.seed(123)
cust_split <- data.xgb %>%
  initial_split(prop = 0.8, strata = Es_mon_treball)

train <- training(cust_split)
test <- testing(cust_split)

# Cross validation folds from training dataset
set.seed(234)
folds <- vfold_cv(train, strata = Es_mon_treball)

cust_rec <- recipe(Es_mon_treball ~., data = train) %>%
#  update_role(customerID, new_role = "ID") %>%
#  step_corr(all_numeric()) %>%
  step_corr(all_numeric(), threshold = 0.7, method = "spearman") %>%
  step_zv(all_numeric()) %>% # filter zero variance
  step_normalize(all_numeric()) %>%
  step_dummy(all_nominal_predictors())

# Setup a model specification
xgb_spec <-boost_tree(
  trees = 500,
  tree_depth = tune(), 
  min_n = tune(),
  loss_reduction = tune(),                    ## first three: model complexity
  sample_size = tune(), mtry = tune(),        ## randomness
  learn_rate = tune()                         ## step size
) %>%
  set_engine("xgboost") %>%
  set_mode("classification")

xgb_wf <- workflow() %>%
  add_formula(Es_mon_treball ~.) %>%
  add_model(xgb_spec)

xgb_grid <- grid_latin_hypercube(
  tree_depth(),
  min_n(),
  loss_reduction(),
  sample_size = sample_prop(),
  finalize(mtry(), train),
  learn_rate(),
  size = 20
)

doParallel::registerDoParallel()

set.seed(234)
xgb_res <-tune_grid(
  xgb_wf,
  resamples = folds,
  grid = xgb_grid,
  control = control_grid(save_pred  = TRUE)
)

xgb_res %>%
  collect_metrics() %>%
  filter(.metric == "roc_auc") %>%
  select(mean, mtry:sample_size) %>%
  pivot_longer(mtry:sample_size,
               names_to = "parameter",
               values_to = "value") %>%
  ggplot(aes(value, mean, color = parameter)) +
  geom_point(show.legend = FALSE)+
  facet_wrap(~parameter, scales = "free_x")

show_best(xgb_res)

## # A tibble: 5 × 12
##    mtry min_n tree_depth    learn_rate loss_reduction sample_size .metric
##   <int> <int>      <int>         <dbl>          <dbl>       <dbl> <chr>  
## 1     9     4          7 0.00000000628     0.0985           0.690 roc_auc
## 2     7     3          5 0.0000366         0.000384         0.386 roc_auc
## 3     8    14          2 0.0500            0.0000364        0.492 roc_auc
## 4    10     7          4 0.0224            2.45             0.136 roc_auc
## 5    11    18          6 0.000269          0.00000428       0.902 roc_auc
## # ℹ 5 more variables: .estimator <chr>, mean <dbl>, n <int>, std_err <dbl>,
## #   .config <chr>

best_auc <- select_best(xgb_res)
final_xgb <- finalize_workflow(xgb_wf, best_auc)

final_rs1 <- last_fit(final_xgb, cust_split,
                     metrics = metric_set(accuracy, roc_auc, sens,spec))
final_rs1 %>%
  collect_metrics()

## # A tibble: 4 × 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.671 Preprocessor1_Model1
## 2 sens     binary         0.725 Preprocessor1_Model1
## 3 spec     binary         0.610 Preprocessor1_Model1
## 4 roc_auc  binary         0.708 Preprocessor1_Model1

Secondly, we will build the model with 9 variables, excluding ‘Greus_CODIF’ and ‘Edat_CODIF’:

best_auc <- select_best(xgb_res)
final_xgb <- finalize_workflow(xgb_wf, best_auc)

final_rs2 <- last_fit(final_xgb, cust_split,
                     metrics = metric_set(accuracy, roc_auc, sens,spec))
final_rs2 %>%
  collect_metrics()

## # A tibble: 4 × 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.673 Preprocessor1_Model1
## 2 sens     binary         0.699 Preprocessor1_Model1
## 3 spec     binary         0.644 Preprocessor1_Model1
## 4 roc_auc  binary         0.710 Preprocessor1_Model1

Thirdly, we will build the model with 10 variables, excluding the variable ‘Numero_lesionats_greus’:

best_auc <- select_best(xgb_res)
final_xgb <- finalize_workflow(xgb_wf, best_auc)

final_rs3 <- last_fit(final_xgb, cust_split,
                     metrics = metric_set(accuracy, roc_auc, sens,spec))
final_rs3 %>%
  collect_metrics()

## # A tibble: 4 × 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.673 Preprocessor1_Model1
## 2 sens     binary         0.699 Preprocessor1_Model1
## 3 spec     binary         0.644 Preprocessor1_Model1
## 4 roc_auc  binary         0.710 Preprocessor1_Model1

Fourthly, we will build the model with 9 variables, excluding the variables ‘Numero_lesionats_greus’ and ‘Edat’:

best_auc <- select_best(xgb_res)
final_xgb <- finalize_workflow(xgb_wf, best_auc)

final_rs4 <- last_fit(final_xgb, cust_split,
                     metrics = metric_set(accuracy, roc_auc, sens,spec))
final_rs4 %>%
  collect_metrics()

## # A tibble: 4 × 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.666 Preprocessor1_Model1
## 2 sens     binary         0.679 Preprocessor1_Model1
## 3 spec     binary         0.652 Preprocessor1_Model1
## 4 roc_auc  binary         0.702 Preprocessor1_Model1

Fifthly, we will build the model with 10 variables, excluding the variable ‘Greus_CODIF’:

best_auc <- select_best(xgb_res)
final_xgb <- finalize_workflow(xgb_wf, best_auc)

final_rs5 <- last_fit(final_xgb, cust_split,
                     metrics = metric_set(accuracy, roc_auc, sens,spec))
final_rs5 %>%
  collect_metrics()

## # A tibble: 4 × 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.668 Preprocessor1_Model1
## 2 sens     binary         0.712 Preprocessor1_Model1
## 3 spec     binary         0.617 Preprocessor1_Model1
## 4 roc_auc  binary         0.705 Preprocessor1_Model1

Sixthly, we will build the model with 10 variables, excluding the variable ‘Edat’:

best_auc <- select_best(xgb_res)
final_xgb <- finalize_workflow(xgb_wf, best_auc)

final_rs6 <- last_fit(final_xgb, cust_split,
                     metrics = metric_set(accuracy, roc_auc, sens,spec))
final_rs6 %>%
  collect_metrics()

## # A tibble: 4 × 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.666 Preprocessor1_Model1
## 2 sens     binary         0.679 Preprocessor1_Model1
## 3 spec     binary         0.652 Preprocessor1_Model1
## 4 roc_auc  binary         0.699 Preprocessor1_Model1

And finally, we will build the model with 10 variables, excluding the variable ‘Edat_CODIF’:

best_auc <- select_best(xgb_res)
final_xgb <- finalize_workflow(xgb_wf, best_auc)

final_rs7 <- last_fit(final_xgb, cust_split,
                     metrics = metric_set(accuracy, roc_auc, sens,spec))
final_rs7 %>%
  collect_metrics()

## # A tibble: 4 × 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.670 Preprocessor1_Model1
## 2 sens     binary         0.699 Preprocessor1_Model1
## 3 spec     binary         0.636 Preprocessor1_Model1
## 4 roc_auc  binary         0.712 Preprocessor1_Model1

Next, we summarize the results obtained in this table, which will include not only accuracy, sensitivity, and specificity but also the so-called Area under the Receiver Operating Characteristic (AUROC), which describes the proportion of the dataset area that would be explained by the probability function of the model:

Model	Excluded Variables	Ac curacy	Sensi tivity	Speci ficity	AUROC
Model #1	None	0.6714	0.7252	0.6098	0.7077
Model #2	“Greus_CODIF”, ” Edat_CODIF”	0.6731	0.6987	0.6439	0.7099
Model #3	“Numero_lesionats _greus”	0.6731	0.6987	0.6439	0.7104
Model #4	” Numero_lesionats_ greus”, “Edat”	0.6661	0.6788	0.6515	0.7016
Model #5	“Greus _CODIF”	0.6678	0.7119	0.6174	0.7049
Model #6	“Edat”	0.6661	0.6788	0.6515	0.6994
Model #7	“Edat_CODIF”	0.6696	0.6987	0.6364	0.7120

In total, up to seven different classification models have been built, differing in the use or non-use of the two continuous numerical variables ‘Edat’ and ‘Numero_lesionats_greus’ and their corresponding coded variables ‘Edat_CODIF’ and ‘Greus_CODIF’. Generally, all seven models show similar accuracy and AUROC scores. However, the most significant differences are observed in the corresponding sensitivity and specificity, except in the cases of Models #2 and #3 where only differences in accuracy or AUROC are observed. It is also noticeable in all cases that sensitivity is higher than specificity. Among these, in the case of Models #3, it is observed that only one of the numerical variables is excluded in each, while in the case of Model #2, the two coded categorical variables are excluded.

Finally, we choose Model #2 as it is the classification model that combines a high proportion of correct positive and negative predictions and, on the other hand, because it’s more easily explained since it does not use the same variable twice, both in its numerical and categorical expression.

final_wf <- final_rs2 %>%
  extract_workflow()

desconegutsPred <- dadesDesconegutSup[, c("Tipus_vehicle_estandaritzat",
                       "Descripcio_causa_mediata",
                       "Nom_mes", 
                       "Numero_lesionats_greus",
                       "Victimes_CODIF",
                       "Vehicles_CODIF",
                       "Edat", "Es_laborable", "Es_ocupacional")]

prediction.xgb <- predict(final_wf, desconegutsPred)
table(prediction.xgb)

## .pred_class
##   No   Si 
## 1113 1083

In the classification results of the dataset of drivers whose reason for travel is unknown, we observe that both classes have been assigned almost equally.

2.3 Decision Tree

Next, we will build a classification model using the Decision Tree algorithm from the C50 module in R, without any pruning or changes to its default parameters:

First, we will check it by also including the variable ‘Nom_mes’:

dummy[] <- lapply(dummy, factor)
colsEscollides <- c("Tipus_vehicle_estandaritzat",
                       "Descripcio_causa_mediata",
                       "Nom_mes",
                       "Greus_CODIF",
                       "Victimes_CODIF",
                       "Vehicles_CODIF",
                       "Edat_CODIF", "Es_laborable", "Es_ocupacional")

y <- dummy$Es_mon_treball
colsSup <- colnames(dummy) %in% colsEscollides
x <- dummy[colsSup]

split_prop <- 5 
indexes = sample(1:nrow(dummy),
                 size=floor(((split_prop-1)/split_prop)*nrow(dummy)))
train_x <- x[indexes, ]
train_y <- y[indexes]
test_x <- x[-indexes, ]
test_y <- y[-indexes]

model1 <- C50::C5.0(train_x, train_y, rules=TRUE)
summary(model1)

## 
## Call:
## C5.0.default(x = train_x, y = train_y, rules = TRUE)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Thu Sep  5 17:19:23 2024
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 2258 cases (10 attributes) from undefined.data
## 
## Rules:
## 
## Rule 1: (1217/386, lift 1.3)
##  Tipus_vehicle_estandaritzat in {Vehicles motoritzats de 2 rodes,
##                                         Vehicles motoritzats de quatre rodes,
##                                         Vehicles sense permis de conduccio}
##  Es_ocupacional = No
##  ->  class No  [0.683]
## 
## Rule 2: (61/1, lift 2.1)
##  Tipus_vehicle_estandaritzat = Vehicles Us Professional
##  ->  class Si  [0.968]
## 
## Rule 3: (999/391, lift 1.3)
##  Es_ocupacional = Si
##  ->  class Si  [0.608]
## 
## Default class: No
## 
## 
## Evaluation on training data (2258 cases):
## 
##          Rules     
##    ----------------
##      No      Errors
## 
##       3  778(34.5%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     831   392    (a): class No
##     386   649    (b): class Si
## 
## 
##  Attribute usage:
## 
##   98.14% Es_ocupacional
##   56.60% Tipus_vehicle_estandaritzat
## 
## 
## Time: 0.0 secs

model1 <- C50::C5.0(train_x, train_y)
plot(model1, type="s", title="Fig. 4. Arbre de decisió sense podar.")

predicted_model1 <- predict(model1, test_x, type="class",
                            threshold=0.7)

print(sprintf("La precisió de l'arbre sense podar és del %.4f %%.",
              100*sum(predicted_model1 == test_y) / length(predicted_model1)))

## [1] "La precisió de l'arbre sense podar és del 65.1327 %."

mat_conf <- table(test_y, Predicted=predicted_model1); mat_conf

##       Predicted
## test_y  No  Si
##     No 183 101
##     Si  96 185

# TN: mat_conf[1]
# FP: mat_conf[2]
# FN: mat_conf[3]
# TP: mat_conf[4]

# sensitivity
sensitivitat <- mat_conf[4] / (mat_conf[4] + mat_conf[3])
print(glue("\n\nLa sensitivitat de la predicció és {round(sensitivitat, 4)}."))

## 
## La sensitivitat de la predicció és 0.6469.

# specifity
especificitat <- mat_conf[1] / (mat_conf[1] + mat_conf[2])
print(glue("\n\nL'especificitat de la predicció és {round(especificitat, 4)}."))

## 
## L'especificitat de la predicció és 0.6559.

In this case, with the unpruned decision tree, a notable improvement in its intelligibility is observed in contrast to the decision tree models built in the second part, noting that the model primarily uses the variables ‘Tipus_vehicle_estandarditzat’ and ‘Es_ocupacional’ for classification. However, we also note that the trend continues for sensitivity to be lower than specificity.

Iti’s remembered too that the decision tree implemented here has the flaw of the lack of robustness of its results. As we already observed in the previous second part, this dataset presents very small class groups which, during the creation of the training dataset through random selection, can be especially underrepresented or even entirely omitted. This phenomenon can cause the results obtained in one iteration to be different from the next, and therefore, its reliability becomes potentially questionable. Finally, we note that, in the case of the decision tree, since the algorithm prioritizes the optimal use of variables with the greatest explanatory capacity such as vehicle type and whether the accident occurred during occupational hours, the other seven categorical variables have virtually no effect.

desconegutsPred <- dadesDesconegutSup[, c("Tipus_vehicle_estandaritzat",
                       "Descripcio_causa_mediata",
                       "Nom_mes",
                       "Greus_CODIF",
                       "Victimes_CODIF",
                       "Vehicles_CODIF",
                       "Edat_CODIF", "Es_laborable", "Es_ocupacional")]

predictions <- predict(model1, desconegutsPred, type="class", threshold=0.7)

table(predictions)

## predictions
##   No   Si 
## 1043 1153

On the other hand, it is noteworthy that the proportion of negative and positive classifications is virtually identical to the proportion of positive and negative classes of the ‘Es_ocupacional’ variable for drivers whose reason for travel is unknown (see Fig. 1.2 above). This suggests that the model is practically using only the ‘Es_ocupacional’ variable to make this classification, possibly without successfully classifying drivers who traveled on non-working days or during non-occupational hours.

2.4 Decision Tree with Alternative Parameterization

We will perform the same decision tree but forcing it to perform up to 99 iterations in order to choose the most optimal predictions in each:

nTrials <-99

model_nTrials <- C50::C5.0(train_x, train_y, trials = nTrials)

predicted_modelnTrials <- predict(model_nTrials, test_x, type="class")

print(sprintf("La precisió de l'arbre amb 99 iteracions és del %.4f %%.",
              100*sum(predicted_modelnTrials == test_y) / length(predicted_modelnTrials)))

## [1] "La precisió de l'arbre amb 99 iteracions és del 64.6018 %."

mat_conf <- table(test_y, Predicted=predicted_modelnTrials); mat_conf

##       Predicted
## test_y  No  Si
##     No 199  85
##     Si 115 166

# TN: mat_conf[1]
# FP: mat_conf[2]
# FN: mat_conf[3]
# TP: mat_conf[4]

# sensitivity
sensitivitat <- mat_conf[4] / (mat_conf[4] + mat_conf[3])
print(glue("\n\nLa sensitivitat de la predicció és {round(sensitivitat, 4)}."))

## 
## La sensitivitat de la predicció és 0.6614.

# specifity
especificitat <- mat_conf[1] / (mat_conf[1] + mat_conf[2])
print(glue("\n\nL'especificitat de la predicció és {round(especificitat, 4)}."))

## 
## L'especificitat de la predicció és 0.6338.

desconegutsPred <- dadesDesconegutSup[, c("Tipus_vehicle_estandaritzat",
                       "Descripcio_causa_mediata",
                       "Nom_mes",
                       "Greus_CODIF",
                       "Victimes_CODIF",
                       "Vehicles_CODIF",
                       "Edat_CODIF", "Es_laborable", "Es_ocupacional")]

predictions <- predict(model_nTrials, desconegutsPred, type="class", threshold=0.7)

table(predictions)

## predictions
##   No   Si 
## 1190 1006

In this model, an improvement in sensitivity is observed, with this indicator also being above specificity. Moreover, when classifying the data of drivers whose reason for travel is unknown, we observe that the vast majority of results are negative.

2.5 Random Forest

Subsequently, we will also build a classification model using the Random Forest that will perform 500 iterations on the training dataset:

library(rpart)

## 
## Adjuntando el paquete: 'rpart'

## The following object is masked from 'package:dials':
## 
##     prune

dummy[] <- lapply(dummy, factor)
colsEscollides <- c("Tipus_vehicle_estandaritzat",
                       "Descripcio_causa_mediata",
                       "Nom_mes",
                       "Greus_CODIF",
                       "Victimes_CODIF",
                       "Vehicles_CODIF",
                       "Edat_CODIF", "Es_laborable", "Es_ocupacional")

y <- dummy$Es_mon_treball
colsSup <- colnames(dummy) %in% colsEscollides
x <- dummy[colsSup]
y <- dummy$Es_mon_treball

rf <- randomForest(y ~., data = x, threshold=0.7); rf

## 
## Call:
##  randomForest(formula = y ~ ., data = x, threshold = 0.7) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 37.27%
## Confusion matrix:
##     No  Si class.error
## No 962 545   0.3616457
## Si 507 809   0.3852584

mat_conf <- rf$confusion

# sensitivity
sensitivitat <- mat_conf[4] / (mat_conf[4] + mat_conf[3])
print(glue("\n\nLa sensitivitat de la predicció és {round(sensitivitat, 4)}."))

## 
## La sensitivitat de la predicció és 0.5975.

especificitat <- mat_conf[1] / (mat_conf[1] + mat_conf[2])
print(glue("\n\nLa especificitat de la predicció és {round(especificitat, 4)}."))

## 
## La especificitat de la predicció és 0.6549.

predictions.rf <- predict( rf, desconegutsPred, type="class")

table(predictions.rf)

## predictions.rf
##   No   Si 
## 1064 1132

It is noted that the outcome of the model is very similar to the previous one where we had modified the parameterization with 99 iterations, although the result of the classifications is manifestly very different.

2.6 Pruned Decision Tree

Finally, we will also try with the pruned tree method that we had implemented in the second part. In this case, we will limit ourselves to building the model using the tree() function from the Rpart module so that it selects the three branches that yield the most optimal result.

dummy[] <- lapply(dummy, factor)
colsEscollides <- c("Tipus_vehicle_estandaritzat",
                       "Descripcio_causa_mediata",
                       "Nom_mes",
                       "Greus_CODIF",
                       "Victimes_CODIF",
                       "Vehicles_CODIF",
                       "Edat_CODIF", "Es_laborable", "Es_ocupacional",
                    "Es_mon_treball")

y <- dummy$Es_mon_treball
colsSup <- colnames(dummy) %in% colsEscollides

dummy_ <- dummy[colsSup]

split <- createDataPartition(y=dummy_$Es_mon_treball, p=4/5, list=FALSE)

train <- dummy_[split,]
test <- dummy_[-split,]

trees <- tree(Es_mon_treball~., train)

prune.trees <- prune.tree(trees, best=3)

tree.pred <- predict(prune.trees, test, type='class', threshold=0.7)
confusionMatrix(tree.pred, test$Es_mon_treball, positive = "Si")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No  Si
##         No 197 100
##         Si 104 163
##                                          
##                Accuracy : 0.6383         
##                  95% CI : (0.5971, 0.678)
##     No Information Rate : 0.5337         
##     P-Value [Acc > NIR] : 3.21e-07       
##                                          
##                   Kappa : 0.274          
##                                          
##  Mcnemar's Test P-Value : 0.8336         
##                                          
##             Sensitivity : 0.6198         
##             Specificity : 0.6545         
##          Pos Pred Value : 0.6105         
##          Neg Pred Value : 0.6633         
##              Prevalence : 0.4663         
##          Detection Rate : 0.2890         
##    Detection Prevalence : 0.4734         
##       Balanced Accuracy : 0.6371         
##                                          
##        'Positive' Class : Si             
##

prediction.prune.tree <- predict(prune.trees, desconegutsPred, type='class',
                                 threshold=0.7)
table(prediction.prune.tree)

## prediction.prune.tree
##   No   Si 
## 1043 1153

It is observed that in the case of the pruned tree model, the results are similar to those of the model built with the unpruned decision tree algorithm, with the positive class of the dependent variable ‘Es_mon_treball’ also being identical when classifying the data subset of drivers whose reason for travel is unknown.

3 Model selection

First, we will present the Table with the most relevant results of this third part and then justify the choice of the XGBoost model result.

3.1 Results Table

Only the accuracy indicators of the logistic regression, XGBoost, and pruned tree models will be detailed since in all three cases the resulting model is robust enough, that is, it does not show appreciable variability in its results.

Index	Results Description	Data Type	Resultats
#1	Total number of drivers	Integer	5019
#2	Subtotal of drivers whose reason for travel is known, proportion of the total	Integer, percent	2823, 56.246%
#3	Number of accidents involving drivers whose reason for travel is known	Integer	2561
#4	Subtotal of drivers whose reason for travel is unknown, proportion of the total	Integer, percent	2196, 43.754%
#5	Number of accidents involving drivers whose reason for travel is unknown	Integer	2029
#6	Drivers injured during occupational hours, proportion of total drivers whose reason for travel is known	Integer, percent	1272, 45.058%
#7	Drivers injured during non-occupational hours, proportion of total drivers whose reason for travel is known	Integer, percent	1551, 54.942%
#8	Drivers injured during occupational hours, proportion of total drivers whose reason for travel is unknown	Integer, percent	1150, 52.368 %
#9	Drivers injured during non-occupational hours, proportion of total drivers whose reason for travel is unknown	Integer, percent	1046, 47.632 %
#10	Number of independent variables assessed for inclusion in the classification model	Integer	29
#11	Number of independent variables accepted for inclusion in subsequent classification models	Integer	11
#12	Acceptance threshold for the positive class result	Float	0.7
#13	In the Logistic Regression model, Sensitivity was higher than Specificity	Boolean	False
#14	Accuracy, Sensitivity, and Specificity of the Logistic Regression model	Float, float, float	0.6645, 0.6468, 0.6788
#14	In the XGBoost model, Sensitivity was higher than Specificity	Boolean	True
#15	Accuracy, Sensitivity, and Specificity of the XGBoost model	Float, float, float	0.6731, 0.6987, 0.6439
#16	In the unparameterized Decision Tree model, Sensitivity was higher than Specificity	Boolean	False
#17	In the parameterized Decision Tree model set to run 99 cycles, Sensitivity was higher than Specificity	Boolean	False
#18	In the Random Forest model, Sensitivity was higher than Specificity	Boolean	False
#19	In the Pruned Tree model, Sensitivity was higher than Specificity	Boolean	False
#20	Accuracy, Sensitivity, and Specificity of the Pruned Tree model	Float, float, float	0.6525, 0.6084, 0.6910

3.2 Rationale

In this third part of our task of classifying drivers injured in traffic accidents during 2023, we have focused to some extent on practically implementing an imputation for the 2,196 drivers whose reason for travel was unknown. The relevance of this imputation lies in the fact that these drivers accounted for up to 43% of the total injured drivers and for whom we had a complete set of valid data (see row #3 in Results Table). Such a high proportion made the variable of the reason for travel unreliable for analysis, but with the work done so far, we assess that it could now be usable as another variable to enrich the study of the trends in traffic accidents involving injured drivers where agents of the Barcelona Guàrdia Urbana intervened.

It is also important to value the variables accepted for their statistical significance in explaining the outcome, which in this case is to indicate “Yes”/“No” for each driver in relation to whether the reason for their travel at the time of being involved in the documented accident was related to the world of work—see details of the significance of this variable in the first part of this study. This is not only because the corresponding algorithm assesses it as statistically optimal, but also because they are consistent with what we know of the world of work. Therefore, it is congruent that the type of vehicle—especially professional use vehicles—or the fact that the accident took place during a working day and occupational hours, are relevant for considering the travel as work-related. Additionally, more intuitively, it also makes sense that age, or in which month or season of the year it took place, have a clear analytical relevance. In this line, and thus also for its subsequent interpretation, it is interesting that the number of serious injuries, whether the number of victims or vehicles involved was more than two, as well as the immediate cause of the accident, also show statistically relevant relationships with a travel reason related, although this should not necessarily also be understood as its cause. It is also of interest that neither the variable referring to biological sex nor whether any of the involved drivers had less than 5 years of experience have proven to be statistically significant when classifying results with the logistic regression algorithm. The case of the District where it took place has also been discarded, suggesting the conclusion that administrative divisions of the territory are not significant in explaining the classification of whether the driver was traveling for a work-related reason or not.

Regarding the XGBoost classification model finally chosen, the implemented classification is robust because it does not derive from a single data split to obtain the training dataset or a randomly chosen parameterization, but rather it is chosen because it explains a greater proportion of the outcomes of the dependent variable ‘Es_mon_treball’ according to the known indicator as Area under a curve (AUC) after conducting 500 different tests, each time seeking to optimize the last result in relation to the previously obtained one, applying weighting to the smaller class groups, thus reducing the bias already detected in the dataset in the second part. A feature that adds reliability to the result is found in the fact that it is the only classification model which, when verifying the proposed classification, also shows a greater proportion of correct predictions for the positive outcome—Si, the reason for the travel is related to the world of work—compared to the negative outcome.

head(dadesDesconegut__)

##   Numero_expedient Descripcio_sexe Edat Descripció_tipus_persona
## 1  2023S000006                Home   49                Conductor
## 2  2023S000007                Home   27                Conductor
## 3  2023S000015                Home   51                Conductor
## 4  2023S000028                Home   49                Conductor
## 5  2023S000033                Home   42                Conductor
## 6  2023S000038                Dona   43                Conductor
##                                                                                                                    Descripcio_Lloc_atropellament_vianant
## 1 Desconegut                                                                                                                                            
## 2 Desconegut                                                                                                                                            
## 3 Desconegut                                                                                                                                            
## 4 Desconegut                                                                                                                                            
## 5 Desconegut                                                                                                                                            
## 6 Desconegut                                                                                                                                            
##   Descripcio_Motiu_desplacament_conductor Desc_Tipus_vehicle_implicat
## 1                            Es desconeix                 Motocicleta
## 2                            Es desconeix                 Motocicleta
## 3                            Es desconeix                   Furgoneta
## 4                            Es desconeix                 Motocicleta
## 5                            Es desconeix                 Motocicleta
## 6                            Es desconeix                  Ciclomotor
##            Tipus_vehicle_estandaritzat
## 1      Vehicles motoritzats de 2 rodes
## 2      Vehicles motoritzats de 2 rodes
## 3 Vehicles motoritzats de quatre rodes
## 4      Vehicles motoritzats de 2 rodes
## 5      Vehicles motoritzats de 2 rodes
## 6      Vehicles motoritzats de 2 rodes
##                                   Descripcio_victimitzacio Victimitzacio_est
## 1 Ferit lleu: Amb assistència sanitària en lloc d'accident        Ferit lleu
## 2                   Ferit lleu: Hospitalització fins a 24h        Ferit lleu
## 3                   Ferit lleu: Hospitalització fins a 24h        Ferit lleu
## 4                   Ferit lleu: Hospitalització fins a 24h        Ferit lleu
## 5                   Ferit lleu: Hospitalització fins a 24h        Ferit lleu
## 6 Ferit lleu: Amb assistència sanitària en lloc d'accident        Ferit lleu
##   Codi_districte  Nom_districte Codi_barri                  Nom_barri
## 1              6         Gracia         31          la Vila de Gràcia
## 2              3 Sants-Montjuic         12 la Marina del Prat Vermell
## 3             10     Sant Marti         68                el Poblenou
## 4              2       Eixample          5              el Fort Pienc
## 5              2       Eixample          7     la Dreta de l'Eixample
## 6              2       Eixample          6         la Sagrada Família
##   Codi_carrer                                         Nom_carrer Num_postal.
## 1      206403 Gran de Gràcia / Gràcia                              0072 0072
## 2         180 Ramon Albó                                           0093 0093
## 3      115603 Espronceda / Pallars                                 0121 0121
## 4      225500 Nàpols                                              0061X0081X
## 5       28305 Bailèn / Ausiàs Marc                                 0043 0043
## 6      191204 Padilla / Mallorca                                  0439B0439B
##   Descripcio_dia_setmana NK_Any Mes_any Nom_mes Dia_mes Hora_dia
## 1               Diumenge   2023       1   Gener       1       17
## 2               Diumenge   2023       1   Gener       1       18
## 3                Dilluns   2023       1   Gener       2       10
## 4                Dilluns   2023       1   Gener       2       21
## 5                Dimarts   2023       1   Gener       3       13
## 6                Dimarts   2023       1   Gener       3       14
##   Descripcio_torn Numero_morts Numero_lesionats_lleus Numero_lesionats_greus
## 1           Tarda            0                      1                      0
## 2           Tarda            0                      1                      0
## 3            Matí            0                      3                      0
## 4           Tarda            0                      1                      0
## 5            Matí            0                      1                      0
## 6           Tarda            0                      1                      0
##   Numero_victimes Numero_vehicles_implicats
## 1               1                         2
## 2               1                         2
## 3               3                         2
## 4               1                         2
## 5               1                         2
## 6               1                         2
##                  Descripcio_causa_mediata Antiguitat_carnet_min
## 1        Avancament defectuos/improcedent                     7
## 2 Manca precaucio incorporacio circulacio                     7
## 3                        Desobeir semafor                    16
## 4         Canvi de carril sense precaucio                    16
## 5           Gir indegut o sense precaucio                    24
## 6                        Desobeir semafor                     1
##   Vehicles motoritzats de 2 rodes implicats
## 1                                         1
## 2                                         2
## 3                                         0
## 4                                         1
## 5                                         2
## 6                                         0
##   Vehicles motoritzats de quatre rodes implicats
## 1                                              1
## 2                                              0
## 3                                              0
## 4                                              0
## 5                                              0
## 6                                              0
##   Vehicles sense permís de conducció implicats
## 1                                            0
## 2                                            0
## 3                                            2
## 4                                            1
## 5                                            0
## 6                                            1
##   Vehicles d'Ús Professional implicats Es_laborable Es_mon_treball
## 1                                    0           No             No
## 2                                    0           No             No
## 3                                    0           Si             Si
## 4                                    0           Si             No
## 5                                    0           Si             Si
## 6                                    0           Si             Si
##   Es_atropellament                           Edat_CODIF
## 1               No Conductors d'entre 38 i 49 anys edat
## 2               No         Conductors fins 29 anys edat
## 3               No    Conductors majors de 49 anys edat
## 4               No Conductors d'entre 38 i 49 anys edat
## 5               No Conductors d'entre 38 i 49 anys edat
## 6               No Conductors d'entre 38 i 49 anys edat
##   Interve_conductor_novell Lleus_CODIF Greus_CODIF Morts_CODIF Victimes_CODIF
## 1                       No          Si          No          No             No
## 2                       No          Si          No          No             No
## 3                       No          Si          No          No             Si
## 4                       No          Si          No          No             No
## 5                       No          Si          No          No             No
## 6                       Si          Si          No          No             No
##   Vehicles_CODIF VM2R_CODIF VM4R_CODIF V_no_permis_CODIF VUP_CODIF
## 1             Si         Si         Si                No        No
## 2             Si         Si         No                No        No
## 3             Si         No         No                Si        No
## 4             Si         Si         No                Si        No
## 5             Si         Si         No                No        No
## 6             Si         No         No                Si        No
##   Es_ocupacional                         Victimitzacio_est_ Estacio_any
## 1             No Amb assistencia sanitaria en lloc accident      Hivern
## 2             No                 Hospitalitzacio fins a 24h      Hivern
## 3             Si                 Hospitalitzacio fins a 24h      Hivern
## 4             No                 Hospitalitzacio fins a 24h      Hivern
## 5             Si                 Hospitalitzacio fins a 24h      Hivern
## 6             Si Amb assistencia sanitaria en lloc accident      Hivern
##   Es_classificacio_ML
## 1                  Si
## 2                  Si
## 3                  Si
## 4                  Si
## 5                  Si
## 6                  Si

head(final)

##   Numero_expedient Es_classificacio_ML Es_mon_treball
## 1  2023S000006                      Si             No
## 2  2023S000007                      Si             No
## 3  2023S000008                      No             Si
## 4  2023S000010                      No             No
## 5  2023S000011                      No             No
## 6  2023S000011                      No             No
##   Descripcio_Motiu_desplacament_conductor Descripcio_sexe Edat
## 1                            Es desconeix            Home   49
## 2                            Es desconeix            Home   27
## 3                               En missió            Home   35
## 4                       Altres activitats            Home   42
## 5                       Altres activitats            Home   31
## 6                       Altres activitats            Dona   64
##                             Edat_CODIF Descripció_tipus_persona
## 1 Conductors d'entre 38 i 49 anys edat                Conductor
## 2         Conductors fins 29 anys edat                Conductor
## 3   Conductors entre 30 i 37 anys edat                Conductor
## 4 Conductors d'entre 38 i 49 anys edat                Conductor
## 5   Conductors entre 30 i 37 anys edat                Conductor
## 6    Conductors majors de 49 anys edat                Conductor
##   Desc_Tipus_vehicle_implicat     Tipus_vehicle_estandaritzat
## 1                 Motocicleta Vehicles motoritzats de 2 rodes
## 2                 Motocicleta Vehicles motoritzats de 2 rodes
## 3                 Motocicleta Vehicles motoritzats de 2 rodes
## 4                 Motocicleta Vehicles motoritzats de 2 rodes
## 5                 Motocicleta Vehicles motoritzats de 2 rodes
## 6                 Motocicleta Vehicles motoritzats de 2 rodes
##                                   Descripcio_victimitzacio Victimitzacio_est
## 1 Ferit lleu: Amb assistència sanitària en lloc d'accident        Ferit lleu
## 2                   Ferit lleu: Hospitalització fins a 24h        Ferit lleu
## 3 Ferit lleu: Amb assistència sanitària en lloc d'accident        Ferit lleu
## 4 Ferit lleu: Amb assistència sanitària en lloc d'accident        Ferit lleu
## 5                   Ferit lleu: Hospitalització fins a 24h        Ferit lleu
## 6 Ferit lleu: Amb assistència sanitària en lloc d'accident        Ferit lleu
##                           Victimitzacio_est_ NK_Any Mes_any Nom_mes Estacio_any
## 1 Amb assistencia sanitaria en lloc accident   2023       1   Gener      Hivern
## 2                 Hospitalitzacio fins a 24h   2023       1   Gener      Hivern
## 3 Amb assistencia sanitaria en lloc accident   2023       1   Gener      Hivern
## 4 Amb assistencia sanitaria en lloc accident   2023       1   Gener      Hivern
## 5                 Hospitalitzacio fins a 24h   2023       1   Gener      Hivern
## 6 Amb assistencia sanitaria en lloc accident   2023       1   Gener      Hivern
##   Dia_mes Descripcio_dia_setmana Hora_dia Descripcio_torn Es_laborable
## 1       1               Diumenge       17           Tarda           No
## 2       1               Diumenge       18           Tarda           No
## 3       1               Diumenge       14           Tarda           No
## 4       2                Dilluns        8            Matí           Si
## 5       2                Dilluns        9            Matí           Si
## 6       2                Dilluns        9            Matí           Si
##   Es_ocupacional Codi_districte       Nom_districte Codi_barri
## 1             No              6              Gracia         31
## 2             No              3      Sants-Montjuic         12
## 3             No              6              Gracia         31
## 4             Si              5 Sarria-Sant Gervasi         26
## 5             Si              7      Horta-Guinardo         39
## 6             Si              7      Horta-Guinardo         39
##                    Nom_barri Codi_carrer
## 1          la Vila de Gràcia      206403
## 2 la Marina del Prat Vermell         180
## 3          la Vila de Gràcia      267000
## 4     Sant Gervasi - Galvany      194803
## 5   Sant Genís dels Agudells      295108
## 6   Sant Genís dels Agudells      295108
##                                           Nom_carrer Num_postal.
## 1 Gran de Gràcia / Gràcia                              0072 0072
## 2 Ramon Albó                                           0093 0093
## 3 Riera de Cassoles                                    0056 0058
## 4 Marc Aureli                                          0031 0033
## 5 Sant Cugat / Samaria                                0002U0002U
## 6 Sant Cugat / Samaria                                0002U0002U
##   Numero_victimes Victimes_CODIF Numero_morts Morts_CODIF
## 1               1             No            0          No
## 2               1             No            0          No
## 3               1             No            0          No
## 4               1             No            0          No
## 5               3             Si            0          No
## 6               3             Si            0          No
##   Numero_lesionats_greus Greus_CODIF Numero_lesionats_lleus Lleus_CODIF
## 1                      0          No                      1          Si
## 2                      0          No                      1          Si
## 3                      0          No                      1          Si
## 4                      0          No                      1          Si
## 5                      0          No                      3          Si
## 6                      0          No                      3          Si
##   Numero_vehicles_implicats Vehicles_CODIF Vehicles d'Ús Professional implicats
## 1                         2             Si                                    0
## 2                         2             Si                                    0
## 3                         2             Si                                    0
## 4                         2             Si                                    0
## 5                         3             Si                                    0
## 6                         3             Si                                    0
##   VUP_CODIF Vehicles motoritzats de 2 rodes implicats VM2R_CODIF
## 1        No                                         1         Si
## 2        No                                         2         Si
## 3        No                                         2         Si
## 4        No                                         2         Si
## 5        No                                         1         Si
## 6        No                                         1         Si
##   Vehicles motoritzats de quatre rodes implicats VM4R_CODIF
## 1                                              1         Si
## 2                                              0         No
## 3                                              0         No
## 4                                              0         No
## 5                                              2         Si
## 6                                              2         Si
##   Vehicles sense permís de conducció implicats V_no_permis_CODIF
## 1                                            0                No
## 2                                            0                No
## 3                                            0                No
## 4                                            0                No
## 5                                            0                No
## 6                                            0                No
##                  Descripcio_causa_mediata Antiguitat_carnet_min
## 1        Avancament defectuos/improcedent                     7
## 2 Manca precaucio incorporacio circulacio                     7
## 3            Manca atencio a la conduccio                     4
## 4 Manca precaucio incorporacio circulacio                    13
## 5           Gir indegut o sense precaucio                     2
## 6           Gir indegut o sense precaucio                     2
##   Interve_conductor_novell
## 1                       No
## 2                       No
## 3                       Si
## 4                       No
## 5                       Si
## 6                       Si
##                                                                                                                    Descripcio_Lloc_atropellament_vianant
## 1 Desconegut                                                                                                                                            
## 2 Desconegut                                                                                                                                            
## 3 A la vorera / Andana                                                                                                                                  
## 4 Desconegut                                                                                                                                            
## 5 Desconegut                                                                                                                                            
## 6 Desconegut                                                                                                                                            
##   Es_atropellament Longitud_WGS84 Latitud_WGS84
## 1               No       2.155459      41.39984
## 2               No       2.177056      41.42689
## 3               Si       2.150179      41.40522
## 4               No       2.142298      41.40195
## 5               No       2.138388      41.42207
## 6               No       2.138388      41.42207

4 Bibliography

BRUCE, P.; BRUCE, A.; i GEDECK, P. 2022. Estadística práctica para ciencia de datos con R y Python, Barcelona: Marcombo.
INSTITUT METRÒPOLI. 2023. Enquesta de mobilitat en dia feiner 2023. (EMEF 2023), Barcelona: Autoritat del Transport Metropolità. URL (last visited August 13th 2024).

Machine Learning-Based Classification of Drivers Injured in Traffic Accidents in Barcelona According to Their Reason for Travel (2023)

Joan Manel Ramírez Jávega

September 5 2024

Contents