Se importan las librerías a usar:

library(dplyr)
library(magrittr)
library(dplyr)
library(knitr)
library(leaflet)
library(cluster)
library(clustertend)
library(NbClust)
library(factoextra)
library(fpc)
library(clValid)
library(kohonen)

Se leen los datos con los que se trabajará. Es importante mencionar que estos datos previamente pasaron por un proceso de preprocesamiento en el cual se hizo la respectiva gestión de datos faltantes, datos atípicos y corrección de datos como el Barrio o la Comuna donde se presentó el accidente:

datos <- read.csv("C:/RDATA/datos_final_corregido.csv")

1. Definición y creación de variables para el agrupamiento

1.1. Accidentes por barrio

Se totalizó y se agrupó la cantidad de accidentes en cada uno de los barrios de Medellín, esto con el fin de realizar un análisis de cuáles barrios presentan mayor o menor accidentalidad.

df_acc <- data.frame(aggregate(CANT.~BARRIO, data = datos, FUN = sum))

Escalamos los datos para el agrupamiento.

df_acc$CANT_sc <- scale(df_acc$CANT.)

1.2. Muertos en accidentes por barrio

Se obtuvo, para cada barrio de Medellín, la proporción de accidentes que involucraron muertos sobre el total de accidentes en aquel barrio entre los años 2014-2018.

datos$MUERTOS <- ifelse(datos$GRAVEDAD == 'Muerto', 1, 0)
df_mue <- data.frame(aggregate(MUERTOS~BARRIO, data = datos, FUN = mean))

Escalamos los datos para el agrupamiento.

df_mue$MUERTOS_sc <- scale(df_mue$MUERTOS)

1.3. Heridos en accidentes por barrio

Se obtuvo, para cada barrio de Medellín, la proporción de accidentes que involucraron heridos sobre el total de accidentes en aquel barrio entre los años 2014-2018.

datos$HERIDOS <- ifelse(datos$GRAVEDAD == 'Herido',1,0)
df_her <- data.frame(aggregate(HERIDOS~BARRIO, data = datos, FUN=mean))

Escalamos los datos para el agrupamiento.

df_her$HERIDOS_sc <- scale(df_her$HERIDOS)

1.4. Atropellos en accidentes por barrio

Se obtuvo, para cada barrio de Medellín, la proporción de accidentes que involucraron atropellos sobre el total de accidentes en aquel barrio entre los años 2014-2018.

datos$ATROPELLOS <- ifelse(datos$CLASE == "Atropello",1,ifelse(datos$CLASE == "Choque y Atropello",1,0))
df_atr <- data.frame(aggregate(ATROPELLOS~BARRIO, data = datos, FUN = mean))

Escalamos los datos para el agrupamiento.

df_atr$ATR_sc <- scale(df_atr$ATROPELLOS)

2. Agrupamiento

2.1. Estadísticos y métricas para el agrupamiento

2.1.1. Hopkins

El estadístico de Hopkins nos permite identificar si un conjunto de datos es óptimo para hacer agrupamiento calculando la probabilidad de que los datos que se pretenden agrupar provengan de una distribución uniforme.

En el estadístico de Hopkins buscamos valor cercanos a cero, esto indicaría que los datos tienen poca probabilidad de tener una distribución uniforme y que así sea de utilidad realizar el agrupamiento. Por el contrario, valores cercanos a 0.5 indicarían que los datos se distribuyen uniformemente y por lo tanto no tendría sentido realizar el agrupamiento.

Se crean unos dataframes con los datos escalados de cada variable que luego serán usados para calcular el estadístico de Hopkins.

df_acc_h <- data.frame(df_acc$CANT_sc)
df_mue_h <- data.frame(df_mue$MUERTOS_sc)
df_her_h <- data.frame(df_her$HERIDOS_sc)
df_atr_h <- data.frame(df_atr$ATR_sc)

set.seed(123)

hopkins(df_acc_h, n=nrow(df_acc_h)-1)

## $H
## [1] 0.2066747

hopkins(df_acc_h, n=nrow(df_mue_h)-1)

## $H
## [1] 0.163892

hopkins(df_acc_h, n=nrow(df_her_h)-1)

## $H
## [1] 0.2094241

hopkins(df_acc_h, n=nrow(df_atr_h)-1)

## $H
## [1] 0.2257641

Los estadísticos de Hopkins resultantes para cada una de las variables a usar son menores a 0.5, incluso están más cerca de cero, por lo tanto se puede inferir que los datos son susceptibles de agrupamiento.

2.1.2. Número óptimo de agrupamientos (clusters)

2.1.2.1. El método del codo (Elbow method)

El método del codo, o Elbow method, permite hallar cuales son los valores óptimos de clusters para un conjunto de datos. Este método prueba diferente hiperparámetros de número de clusters, calcula las varianza al interior de los grupos y sugiere que el valor adecuado es aquel a partir del cuál se dejan de evidenciar mejoras sustanciales en el agrupamiento.

Aplicamos el método del codo a nuestros datos.

fviz_nbclust(df_acc_h, kmeans, method = "wss")+ labs(title="Número óptimo de agrupaciones", Subtitle="Método \"Elbow\"")

Se evidencia que a partir de k=2 o k=3 no hay mejoras sustanciales en el agrupamiento.

fviz_nbclust(df_mue_h, kmeans, method = "wss")+ labs(title="Número óptimo de agrupaciones", Subtitle="Método \"Elbow\"")

Se evidencia que a partir de k=2 no hay mejoras sustanciales en el agrupamiento.

fviz_nbclust(df_her_h, kmeans, method = "wss")+ labs(title="Número óptimo de agrupaciones", Subtitle="Método \"Elbow\"")

Se evidencia que a partir de k=2 y k=3 no hay mejoras sustanciales en el agrupamiento.

fviz_nbclust(df_atr_h, kmeans, method = "wss")+ labs(title="Número óptimo de agrupaciones", Subtitle="Método \"Elbow\"")

Se evidencia que a partir de k=3 no hay mejoras sustanciales en el agrupamiento.

Marcamos gráficamente el punto óptimo para cada agrupamiento.

Accidentes:

fviz_nbclust(df_acc_h, kmeans, method = "wss")+ geom_vline(xintercept=2, linetype=2)+ labs(title="Número óptimo de agrupaciones", Subtitle="Método \"Elbow\"")

Muertos:

fviz_nbclust(df_mue_h, kmeans, method = "wss")+ geom_vline(xintercept=2, linetype=2)+ labs(title="Número óptimo de agrupaciones", Subtitle="Método \"Elbow\"")

Heridos:

fviz_nbclust(df_her_h, kmeans, method = "wss")+ geom_vline(xintercept=2, linetype=2)+ labs(title="Número óptimo de agrupaciones", Subtitle="Método \"Elbow\"")

Atropellos:

fviz_nbclust(df_atr_h, kmeans, method = "wss")+ geom_vline(xintercept=3, linetype=2)+ labs(title="Número óptimo de agrupaciones", Subtitle="Método \"Elbow\"")

2.1.2.2. Ensambles

También se hace uso de un método de ensamble para hallar el número óptimo de clusters a partir de 26 métricas utilizando la distancia euclideana para calcular la matriz de disimilaridad.

Accidentes:

fviz_nbclust(NbClust(df_acc_h, distance = "euclidean",min.nc = 2, max.nc = 10, method = "kmeans", index ="all"))

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 1 proposed 2 as the best number of clusters 
## * 1 proposed 3 as the best number of clusters 
## * 2 proposed 6 as the best number of clusters 
## * 1 proposed 7 as the best number of clusters 
## * 1 proposed 8 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  6 
##  
##  
## ******************************************************************* 
## Among all indices: 
## ===================
## * 1 proposed  -Inf as the best number of clusters
## * 1 proposed  -6.8197 as the best number of clusters
## * 2 proposed  0 as the best number of clusters
## * 1 proposed  0.0112 as the best number of clusters
## * 1 proposed  0.0383 as the best number of clusters
## * 1 proposed  0.0428 as the best number of clusters
## * 1 proposed  0.2305 as the best number of clusters
## * 1 proposed  0.3373 as the best number of clusters
## * 1 proposed  0.4999 as the best number of clusters
## * 1 proposed  0.546 as the best number of clusters
## * 1 proposed  0.6525 as the best number of clusters
## * 1 proposed  0.7702 as the best number of clusters
## * 1 proposed  0.814 as the best number of clusters
## * 1 proposed  2 as the best number of clusters
## * 1 proposed  2.5476 as the best number of clusters
## * 1 proposed  3 as the best number of clusters
## * 2 proposed  6 as the best number of clusters
## * 1 proposed  7 as the best number of clusters
## * 1 proposed  8 as the best number of clusters
## * 1 proposed  9.4423 as the best number of clusters
## * 1 proposed  10.0632 as the best number of clusters
## * 1 proposed  23.4569 as the best number of clusters
## * 1 proposed  64.4946 as the best number of clusters
## * 1 proposed  552.0397 as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  0 .

Muertos:

fviz_nbclust(NbClust(df_mue_h, distance = "euclidean",min.nc = 2, max.nc = 10, method = "kmeans", index ="all"))

## [1] "Frey index : No clustering structure in this data set"

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 2 proposed 2 as the best number of clusters 
## * 1 proposed 4 as the best number of clusters 
## * 2 proposed 6 as the best number of clusters 
## * 1 proposed 9 as the best number of clusters 
## * 1 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## ******************************************************************* 
## Among all indices: 
## ===================
## * 1 proposed  -Inf as the best number of clusters
## * 1 proposed  -1291.4199 as the best number of clusters
## * 2 proposed  0 as the best number of clusters
## * 1 proposed  1e-04 as the best number of clusters
## * 1 proposed  0.0041 as the best number of clusters
## * 1 proposed  0.0064 as the best number of clusters
## * 1 proposed  0.0089 as the best number of clusters
## * 1 proposed  0.3779 as the best number of clusters
## * 1 proposed  0.4679 as the best number of clusters
## * 1 proposed  0.7028 as the best number of clusters
## * 1 proposed  0.8651 as the best number of clusters
## * 1 proposed  0.9927 as the best number of clusters
## * 1 proposed  0.9972 as the best number of clusters
## * 1 proposed  1.2248 as the best number of clusters
## * 1 proposed  1.4207 as the best number of clusters
## * 2 proposed  2 as the best number of clusters
## * 1 proposed  4 as the best number of clusters
## * 2 proposed  6 as the best number of clusters
## * 1 proposed  9 as the best number of clusters
## * 1 proposed  10 as the best number of clusters
## * 1 proposed  348.0256 as the best number of clusters
## * 1 proposed  3058.2622 as the best number of clusters
## * 1 proposed  NA's as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  0 .

Heridos:

fviz_nbclust(NbClust(df_her_h, distance = "euclidean",min.nc = 2, max.nc = 10, method = "kmeans", index ="all"))

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 2 proposed 2 as the best number of clusters 
## * 2 proposed 6 as the best number of clusters 
## * 1 proposed 9 as the best number of clusters 
## * 1 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## ******************************************************************* 
## Among all indices: 
## ===================
## * 1 proposed  -Inf as the best number of clusters
## * 1 proposed  -2.9193 as the best number of clusters
## * 2 proposed  0 as the best number of clusters
## * 1 proposed  0.0159 as the best number of clusters
## * 1 proposed  0.0179 as the best number of clusters
## * 1 proposed  0.1363 as the best number of clusters
## * 1 proposed  0.137 as the best number of clusters
## * 1 proposed  0.3545 as the best number of clusters
## * 1 proposed  0.4606 as the best number of clusters
## * 1 proposed  0.5238 as the best number of clusters
## * 1 proposed  0.5768 as the best number of clusters
## * 1 proposed  0.5875 as the best number of clusters
## * 1 proposed  0.9487 as the best number of clusters
## * 2 proposed  2 as the best number of clusters
## * 1 proposed  5.0821 as the best number of clusters
## * 2 proposed  6 as the best number of clusters
## * 1 proposed  9 as the best number of clusters
## * 1 proposed  9.9576 as the best number of clusters
## * 1 proposed  10 as the best number of clusters
## * 1 proposed  11.662 as the best number of clusters
## * 1 proposed  24.1675 as the best number of clusters
## * 1 proposed  88.113 as the best number of clusters
## * 1 proposed  NA's as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  0 .

Atropellos:

fviz_nbclust(NbClust(df_atr_h, distance = "euclidean",min.nc = 2, max.nc = 10, method = "kmeans", index ="all"))

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 2 proposed 3 as the best number of clusters 
## * 2 proposed 5 as the best number of clusters 
## * 1 proposed 6 as the best number of clusters 
## * 1 proposed 7 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## ******************************************************************* 
## Among all indices: 
## ===================
## * 1 proposed  -Inf as the best number of clusters
## * 1 proposed  -5.2496 as the best number of clusters
## * 2 proposed  0 as the best number of clusters
## * 1 proposed  0.0116 as the best number of clusters
## * 1 proposed  0.0848 as the best number of clusters
## * 1 proposed  0.1624 as the best number of clusters
## * 1 proposed  0.2233 as the best number of clusters
## * 1 proposed  0.3268 as the best number of clusters
## * 1 proposed  0.4966 as the best number of clusters
## * 1 proposed  0.5021 as the best number of clusters
## * 1 proposed  0.5675 as the best number of clusters
## * 1 proposed  0.6148 as the best number of clusters
## * 1 proposed  0.6372 as the best number of clusters
## * 1 proposed  1.8989 as the best number of clusters
## * 2 proposed  3 as the best number of clusters
## * 2 proposed  5 as the best number of clusters
## * 1 proposed  5.919 as the best number of clusters
## * 1 proposed  6 as the best number of clusters
## * 1 proposed  7 as the best number of clusters
## * 1 proposed  13.6213 as the best number of clusters
## * 1 proposed  39.9372 as the best number of clusters
## * 1 proposed  86.9702 as the best number of clusters
## * 1 proposed  133.8499 as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  0 .

2.1.2. Elección del mejor algoritmo de Clustering

Se evaluaron diferentes métodos de clustering con el fin de determinar cuál de estos es el óptimo para los datos que se tienen. Se evaluaron los siguientes métodos:

K-Means
PAM
CLARA
SOM

Accidentes:

intern <- clValid(df_acc_h, nClust=2:10, clMethods = c("kmeans","pam","clara", "som"),validation = "internal", maxitems = 600)
summary(intern)

## 
## Clustering Methods:
##  kmeans pam clara som 
## 
## Cluster sizes:
##  2 3 4 5 6 7 8 9 10 
## 
## Validation Measures:
##                            2       3       4       5       6       7       8       9      10
##                                                                                             
## kmeans Connectivity   6.0452 12.1524 13.0218 21.3496 26.3028 28.7194 29.6254 35.6329 40.4504
##        Dunn           0.0112  0.0067  0.0088  0.0043  0.0042  0.0051  0.0064  0.0090  0.0099
##        Silhouette     0.7702  0.6325  0.6202  0.6100  0.6004  0.6005  0.6017  0.5955  0.5969
## pam    Connectivity   7.1710  6.0131  7.9980 16.5417 30.8599 35.7595 35.0460 39.5242 49.1615
##        Dunn           0.0018  0.0021  0.0034  0.0050  0.0009  0.0012  0.0023  0.0027  0.0020
##        Silhouette     0.5928  0.6060  0.5828  0.5995  0.5931  0.5970  0.5582  0.5576  0.5503
## clara  Connectivity   7.1710  6.8583 12.6627 18.9020 19.2623 25.1877 39.0984 47.8409 47.4163
##        Dunn           0.0018  0.0033  0.0007  0.0023  0.0027  0.0041  0.0009  0.0009  0.0029
##        Silhouette     0.5928  0.5989  0.5557  0.5949  0.5899  0.5781  0.5380  0.5339  0.5253
## som    Connectivity   6.0452 12.1524 13.3909 15.0234 23.3218 31.8270 35.0996 39.0460 42.5079
##        Dunn           0.0112  0.0067  0.0023  0.0076  0.0012  0.0018  0.0023  0.0012  0.0018
##        Silhouette     0.7702  0.6325  0.6127  0.5946  0.5833  0.5948  0.5541  0.5613  0.5565
## 
## Optimal Scores:
## 
##              Score  Method Clusters
## Connectivity 6.0131 pam    3       
## Dunn         0.0112 kmeans 2       
## Silhouette   0.7702 kmeans 2

optimalScores(intern)

##                  Score Method Clusters
## Connectivity 6.0130952    pam        3
## Dunn         0.0112069 kmeans        2
## Silhouette   0.7701545 kmeans        2

plot(intern)

Método recomendado: K-Means Número óptimo de clusters por mayoría: 2

Muertos:

intern <- clValid(df_mue_h, nClust=2:10, clMethods = c("kmeans","pam","clara", "som"),validation = "internal", maxitems = 600)
summary(intern)

## 
## Clustering Methods:
##  kmeans pam clara som 
## 
## Cluster sizes:
##  2 3 4 5 6 7 8 9 10 
## 
## Validation Measures:
##                            2       3       4       5       6       7       8       9      10
##                                                                                             
## kmeans Connectivity   4.2869  7.2159 14.0183 18.5226 20.3448 29.7175 27.8663 33.3226 38.5476
##        Dunn           9.0000  1.2500  0.0164  0.0065  0.0097  0.0072  0.0159  0.0096  0.0102
##        Silhouette     0.9927  0.9217  0.7503  0.6020  0.5947  0.6187  0.6406  0.6465  0.6425
## pam    Connectivity   4.2869  8.1734 13.5909 18.4056 24.7056 25.3813 33.3770 26.3143 42.2405
##        Dunn           9.0000  0.0011  0.0011  0.0015  0.0016  0.0062  0.0021  0.0051  0.0051
##        Silhouette     0.9927  0.5142  0.5165  0.6047  0.6211  0.6320  0.6267  0.6329  0.6395
## clara  Connectivity   4.2869  8.8766 11.7881 16.7246 31.6044 30.0607 34.2333 44.1897 43.2437
##        Dunn           9.0000  0.0006  0.0012  0.0015  0.0009  0.0011  0.0015  0.0006  0.0014
##        Silhouette     0.9927  0.4454  0.5081  0.6069  0.6173  0.6147  0.6393  0.6072  0.6162
## som    Connectivity   4.2869 10.6571 17.8476 18.5226 24.7111 22.2702 27.3512 30.4274      NA
##        Dunn           9.0000  0.0075  0.0014  0.0065  0.0018  0.0077  0.0047  0.0026      NA
##        Silhouette     0.9927  0.7986  0.6119  0.6020  0.6100  0.6373  0.6249  0.6311      NA
## 
## Optimal Scores:
## 
##              Score  Method Clusters
## Connectivity 4.2869 kmeans 2       
## Dunn         9.0000 kmeans 2       
## Silhouette   0.9927 kmeans 2

plot(intern)

Método recomendado: K-Means Número óptimo de clusters por mayoría: 2

Heridos:

intern <- clValid(df_her_h, nClust=2:10, clMethods = c("kmeans","pam","clara", "som"),validation = "internal", maxitems = 600)
summary(intern)

## 
## Clustering Methods:
##  kmeans pam clara som 
## 
## Cluster sizes:
##  2 3 4 5 6 7 8 9 10 
## 
## Validation Measures:
##                            2       3       4       5       6       7       8       9      10
##                                                                                             
## kmeans Connectivity   6.0087  8.3476 16.9028 16.9004 17.1310 24.9742 31.5103 32.2639 37.7913
##        Dunn           0.0030  0.0037  0.0090  0.0134  0.0159  0.0229  0.0175  0.0220  0.0227
##        Silhouette     0.5875  0.5418  0.5136  0.5203  0.5582  0.5417  0.5290  0.5485  0.5578
## pam    Connectivity   0.8984 10.2754  8.8627 19.0869 18.4488 18.1532 22.6028 28.1651 33.8996
##        Dunn           0.0109  0.0035  0.0031  0.0021  0.0072  0.0122  0.0161  0.0131  0.0071
##        Silhouette     0.5820  0.4917  0.5024  0.4858  0.4757  0.5130  0.5500  0.5588  0.5543
## clara  Connectivity   5.8464 11.5516 12.8377 21.5135 29.3107 31.9698 38.7333 40.1290 49.3952
##        Dunn           0.0016  0.0043  0.0028  0.0038  0.0034  0.0051  0.0035  0.0067  0.0005
##        Silhouette     0.5686  0.4995  0.5006  0.4816  0.5491  0.5105  0.5275  0.5524  0.5415
## som    Connectivity   3.7131  3.9242 13.9980 17.1496 17.1310 32.3107 29.9512 38.2091 33.4421
##        Dunn           0.0057  0.0084  0.0005  0.0070  0.0159  0.0070  0.0041  0.0055  0.0094
##        Silhouette     0.5879  0.5410  0.5086  0.5269  0.5582  0.5324  0.5400  0.5533  0.5522
## 
## Optimal Scores:
## 
##              Score  Method Clusters
## Connectivity 0.8984 pam    2       
## Dunn         0.0229 kmeans 7       
## Silhouette   0.5879 som    2

optimalScores(intern)

##                   Score Method Clusters
## Connectivity 0.89841270    pam        2
## Dunn         0.02289526 kmeans        7
## Silhouette   0.58785463    som        2

plot(intern)

Método recomendado: K-Means Número óptimo de clusters por mayoría: 2

Atropellos:

intern <- clValid(df_atr_h, nClust=2:10, clMethods = c("kmeans","pam","clara", "som"),validation = "internal", maxitems = 600)
summary(intern)

## 
## Clustering Methods:
##  kmeans pam clara som 
## 
## Cluster sizes:
##  2 3 4 5 6 7 8 9 10 
## 
## Validation Measures:
##                            2       3       4       5       6       7       8       9      10
##                                                                                             
## kmeans Connectivity   3.5440  6.8647 14.1456 15.8810 15.4218 19.7929 22.8762 31.2718 39.2583
##        Dunn           0.0074  0.0071  0.0073  0.0116  0.0299  0.0353  0.0353  0.0151  0.0117
##        Silhouette     0.6148  0.6051  0.5775  0.5825  0.5836  0.5860  0.5847  0.5559  0.5524
## pam    Connectivity   5.9448  9.3968  4.6817 11.4135 18.5722 26.3857 33.3405 35.4599 46.3294
##        Dunn           0.0027  0.0021  0.0132  0.0069  0.0126  0.0055  0.0019  0.0023  0.0079
##        Silhouette     0.5930  0.5022  0.5708  0.5520  0.5680  0.5444  0.5517  0.5540  0.5376
## clara  Connectivity   5.0746 11.7107 14.1357 13.6115 30.5694 27.9135 30.0536 42.5714 47.2214
##        Dunn           0.0037  0.0017  0.0045  0.0032  0.0007  0.0020  0.0055  0.0031  0.0083
##        Silhouette     0.6036  0.5170  0.5502  0.5249  0.5164  0.5335  0.5539  0.5533  0.5450
## som    Connectivity   6.9183  9.6194 14.3571 13.2492 20.5075 34.3821 26.2111 40.7905 42.0563
##        Dunn           0.0017  0.0073  0.0020  0.0250  0.0061  0.0026  0.0026  0.0036  0.0076
##        Silhouette     0.6152  0.5941  0.5491  0.5822  0.5550  0.5435  0.5624  0.5203  0.5478
## 
## Optimal Scores:
## 
##              Score  Method Clusters
## Connectivity 3.5440 kmeans 2       
## Dunn         0.0353 kmeans 7       
## Silhouette   0.6152 som    2

optimalScores(intern)

##                   Score Method Clusters
## Connectivity 3.54404762 kmeans        2
## Dunn         0.03531397 kmeans        7
## Silhouette   0.61523844    som        2

plot(intern)

Método recomendado: K-Means Número óptimo de clusters por mayoría: 2

2.1.3. Coeficiente de silhouette

Este método es muy similar al método del codo pero con la diferencia que, en lugar de minimizar la varianza al interior de los clusters, busca maximizar el valor del coeficiente de silhouette que representa la similaridad de los individuos al interior de cada cluster, apelando a la premisa del agrupamiento que sugiere que los individuos al interior de un cluster deben ser tan similares como sea posible.

Accidentes:

fviz_nbclust(df_acc_h, kmeans, method = "silhouette")+  labs(title="Número óptimo de agrupaciones", Subtitle="Método \"Silhouette\"")

Número óptimo de clusters: 2

Muertos:

fviz_nbclust(df_mue_h, kmeans, method = "silhouette")+  labs(title="Número óptimo de agrupaciones", Subtitle="Método \"Silhouette\"")

Número óptimo de clusters: 2

Heridos:

fviz_nbclust(df_her_h, kmeans, method = "silhouette")+  labs(title="Número óptimo de agrupaciones", Subtitle="Método \"Silhouette\"")

Número óptimo de clusters: 2

Atropellos:

fviz_nbclust(df_atr_h, kmeans, method = "silhouette")+  labs(title="Número óptimo de agrupaciones", Subtitle="Método \"Silhouette\"")

Número óptimo de clusters: 2

2.2. K-Means

Accidentes:

grupos_acc <- kmeans(df_acc$CANT_sc, 2)
acc_agr <- data.frame(cbind(df_acc$BARRIO, grupos_acc$cluster))
grupos_acc$centers <- order(grupos_acc$centers)
colnames(acc_agr) <- c("BARRIO", "GR_ACCIDENTES")
acc_agr <- inner_join(acc_agr, df_acc, by = "BARRIO")

Muertos:

grupos_mue <- kmeans(df_mue$MUERTOS_sc, 2)
mue_agr <- data.frame(cbind(df_mue$BARRIO, grupos_mue$cluster))
grupos_mue$centers <- order(grupos_mue$centers)
colnames(mue_agr) <- c("BARRIO", "GR_MUERTOS")
mue_agr <- inner_join(mue_agr, df_mue, by = "BARRIO")

Heridos:

grupos_her <- kmeans(df_her$HERIDOS_sc, 2)
her_agr <- data.frame(cbind(df_her$BARRIO, grupos_her$cluster))
grupos_her$centers <- order(grupos_her$centers)
colnames(her_agr) <- c("BARRIO", "GR_HERIDOS")
her_agr <- inner_join(her_agr, df_her, by = "BARRIO")

Atropellos:

grupos_atr <- kmeans(df_atr$ATR_sc, 3)
atr_agr <- data.frame(cbind(df_atr$BARRIO, grupos_atr$cluster))
grupos_atr$centers <- order(grupos_atr$centers)
colnames(atr_agr) <- c("BARRIO", "GR_ATROPELLOS")
atr_agr <- inner_join(atr_agr, df_atr, by = "BARRIO")

Se calcula la matriz de similaridad utilizando la distancia euclidiana

distancia_acc <-dist(df_acc_h, method="euclidean")
distancia_mue <-dist(df_mue_h, method="euclidean")
distancia_her <-dist(df_her_h, method="euclidean")
distancia_atr <-dist(df_atr_h, method="euclidean")

2.2.1. Resultados estadísticos del K-Means

Distancia interna (intra-cluster): siloidwidth. Distancia externa (inter-cluster): dunn

acc_statsKmeans <- cluster.stats(distancia_acc, grupos_acc$cluster)
mue_statsKmeans <- cluster.stats(distancia_mue, grupos_mue$cluster)
her_statsKmeans <- cluster.stats(distancia_her, grupos_her$cluster)
atr_statsKmeans <- cluster.stats(distancia_atr, grupos_atr$cluster)

2.2.3. Mapas

Se obtiene latitud y longitud promedio de cada barrio y luego se unen los datos de cada cluster.

barrios_map <- aggregate(cbind(LATITUD, LONGITUD)~BARRIO, data = datos, FUN = mean)
barrios_acc <- inner_join(barrios_map, acc_agr, by= "BARRIO")
barrios_mue <- inner_join(barrios_map, mue_agr, by= "BARRIO")
barrios_her <- inner_join(barrios_map, her_agr, by= "BARRIO")
barrios_atr <- inner_join(barrios_map, atr_agr, by= "BARRIO")

barrios_acc$COLOR <- ifelse(barrios_acc$GR_ACCIDENTES == grupos_acc$centers[1], "blue","red")
barrios_mue$COLOR <- ifelse(barrios_mue$GR_MUERTOS == grupos_mue$centers[1], "blue","red")
barrios_her$COLOR <- ifelse(barrios_her$GR_HERIDOS == grupos_her$centers[1], "blue","red")
barrios_atr$COLOR <- ifelse(barrios_atr$GR_ATROPELLOS == grupos_atr$centers[1], "blue",
                          ifelse(barrios_atr$GR_ATROPELLOS == grupos_atr$centers[2],"yellow", "red"))

longitud <- barrios_acc$LONGITUD
latitud <- barrios_acc$LATITUD
df <- data.frame(longitud,latitud)
leaflet(df) %>% addTiles() %>%
  addMarkers(lng = ~longitud, lat = ~latitud,
             clusterOptions = markerClusterOptions()
  )

ACCIDENTES:

color <- barrios_acc$COLOR
barrios_acc$TAMANO <- ifelse(barrios_acc$GR_ACCIDENTES == grupos_acc$centers[1], 10,20)
tamano <- barrios_acc$TAMANO
titulo <- barrios_acc$BARRIO
df <- data.frame(longitud, latitud, color, tamano, titulo)
leaflet(df) %>% addTiles() %>%
  addCircles(lng = ~longitud, lat = ~latitud, weight = 1, color=~color,
             radius = ~tamano * 20, popup = ~titulo)

En el mapa de accidentes se ve reflejado una mayor accidentalidad a lo largo del corredor central de la ciudad de Medellín, en cercanías al río Medellín. Esto quizás se debe a que el tráfico vehicular de la ciudad normalmente converge en este corredor central pues allí se encuentran las vías más rápidas y directas que conectan el extremo sur con el extremo norte de la ciudad. También se encuentran allí sectores o barrios con un gran conglomerado de fábricas, empresas y comercio en general, haciendo de este corredor un gran concentrador del tráfico vehicular de la ciudad.

MUERTOS:

color <- barrios_mue$COLOR
barrios_mue$TAMANO <- ifelse(barrios_mue$GR_MUERTOS == grupos_mue$centers[1], 10,20)
tamano <- barrios_mue$TAMANO
titulo <- barrios_mue$BARRIO
df <- data.frame(longitud, latitud, color, tamano, titulo)
leaflet(df) %>% addTiles() %>%
  addCircles(lng = ~longitud, lat = ~latitud, weight = 1, color=~color,
             radius = ~tamano * 20, popup = ~titulo)

El mapa de muertos es quizás el mapa con una menor interpretabilidad por parte del usuario. Sin embargo, cabe resaltar aquellos puntos o barrios que son demarcados como críticos en cuánto a su proporción de accidentes que involucran muertos. Barrios como San Antonio de Prado y San Javier clasifican en este grupo quizás debido a que en barrios de estratos bajos un gran porcentaje del tráfico vehicular corresponde a motociclistas que, además de estar más expuestos a accidentes por el tipo de vehículo en el que se movilizan, son actores viales que en barrios de estratos bajos la gran mayoría no usa los elementos de seguridad adecuados para su protección como el casco.

HERIDOS:

color <- barrios_her$COLOR
barrios_her$TAMANO <- ifelse(barrios_her$GR_HERIDOS == grupos_her$centers[1], 10,20)
tamano <- barrios_her$TAMANO
titulo <- barrios_her$BARRIO
df <- data.frame(longitud, latitud, color, tamano, titulo)
leaflet(df) %>% addTiles() %>%
  addCircles(lng = ~longitud, lat = ~latitud, weight = 1, color=~color,
             radius = ~tamano * 20, popup = ~titulo)

En el mapa de heridos se evidencia que aquellos barrios con mayores proporciones de heridos en sus accidentes corresponden a barrios generalmente periféricos. Al igual que en el mapa de heridos, este patrón puede verse explicado, entre otros factores, por la alta proporción de motocicletas en el total de vehículos y la poca cultura ciudadana en el uso de los elementos de seguridad como el casco.

ATROPELLOS:

color <- barrios_atr$COLOR
barrios_atr$TAMANO <- ifelse(barrios_atr$GR_ATROPELLOS == grupos_atr$centers[1], 10,20)
tamano <- barrios_atr$TAMANO
titulo <- barrios_atr$BARRIO
df <- data.frame(longitud, latitud, color, tamano, titulo)
leaflet(df) %>% addTiles() %>%
  addCircles(lng = ~longitud, lat = ~latitud, weight = 1, color=~color,
             radius = ~tamano * 20, popup = ~titulo)

En el mapa de atropellos nuevamente son los barrios periféricos aquellos datos que más información nos revelan acerca de la accidentalidad en Medellín. En este caso se puede evidenciar un aumento gradual de la proporción de atropellos en cada barrio, este aumento se da desde el corredor central de la ciudad, alrededor del río Medellín, hacia los barrios en los extremos oriental, occidental y norte de la ciudad. Un factor que puede tener incidencia en este comportamiento es el alto volumen de infracciones a las normas de tránsito que se dan en estos sectores, normas de tránsito que son de difícil regulación si se tiene en cuenta la ubicación de estos sectores y su difícil acceso.

Agrupamiento Univariable

Manuela Londono Ocampo, Alexis Arenas Bustamante, Juan Esteban Arroyave, William Jovel Tamayo

8/9/2020

1. Definición y creación de variables para el agrupamiento

1.1. Accidentes por barrio

1.2. Muertos en accidentes por barrio

1.3. Heridos en accidentes por barrio

1.4. Atropellos en accidentes por barrio

2. Agrupamiento

2.1. Estadísticos y métricas para el agrupamiento

2.1.1. Hopkins

2.1.2. Número óptimo de agrupamientos (clusters)

2.1.2.1. El método del codo (Elbow method)

2.1.2.2. Ensambles

2.1.2. Elección del mejor algoritmo de Clustering

2.1.3. Coeficiente de silhouette

2.2. K-Means

2.2.1. Resultados estadísticos del K-Means

2.2.3. Mapas