Un ejemplo de aplicación del análisis de conglomrados

Diseño de un plan de incentivos para vendedores

El director de ventas de una cadena de tiendas de electrodomésticos con implantación nacional está estudiando el plan de incentivos de sus vendedores. Considera que los incentivos deben estar ajustados a las dificultades de las distintas zonas de ventas, siendo necesario fijar incentivos más altos en aquellas zonas geográficas en que las condiciones de vida de sus habitantes hacen más difícil las ventas. Por este motivo quiere determinar si las comunidades autónomas se pueden segmentar en grupos homogéneos respecto al equipamiento de los hogares. Para ello dispone de los datos que aparecen en el cuadro 3.22, y el objetivo es establecer cuantas grupos de comunidades autonomas con niveles de equipamiento similar pueden establecerse y en que radican las diferencias entre esos grupos. el procedimiento que aplicaremos es el escrito en el sigui8ente tema a saber.

Análisis de la existencia de outliers en la medida en que pueden generar importantes distorsiones en la detección del número de grupos.

Realización de un análisis de conglomerados jerárquicos, evaluando solución de distintos métodos de conglomeración, aplicando los criterios presentados para identificar el número adecuado de grupos y obtención de los centroides que han de servir de partida para el paso siguiente.

Realización de un análisis de conglomerados no jerárquico mediante el método de k-medias para la obtención de una solución óptima en términos de homogeneidad intrasegmentos y heterogeneidad intersegmentos.

library(kableExtra)


cuadrot <- data.frame(
  CCAA = c("España","Andalucía","Aragón","Asturias","Baleares","Canarias","Cantabria",
           "Castilla y León","Cast.-La Mancha","Cataluña","Com. Valenciana","Extremadura",
           "Galicia","Madrid","Murcia","Navarra","País Vasco","La Rioja"),
  Automovil = c(69.0,66.7,67.2,63.7,71.9,72.7,63.4,65.8,61.5,70.4,72.7,60.5,65.5,74.0,69.0,76.4,71.3,64.9),
  TV_color = c(97.6,98.0,97.5,95.2,98.8,96.8,94.9,97.1,97.3,98.1,98.4,97.7,91.3,99.4,98.7,99.3,98.3,98.6),
  Video = c(62.4,82.7,56.8,52.1,62.4,68.4,48.9,47.7,53.6,71.1,68.2,43.7,42.7,76.3,59.3,60.6,61.6,54.4),
  Microondas = c(32.3,24.1,43.4,24.4,29.8,27.9,36.5,28.1,21.7,36.8,26.6,20.7,13.5,53.9,19.5,44.0,45.7,44.4),
  Lavavajillas = c(17.0,12.7,20.6,13.3,10.1,5.8,11.2,14.0,7.1,19.8,12.1,11.7,14.6,32.3,12.1,20.6,23.7,17.6),
  Telefono = c(85.2,74.7,88.4,88.1,87.9,75.4,80.5,85.0,72.9,92.2,84.4,67.1,85.9,95.7,81.4,87.4,94.3,83.4)
)

cuadrot %>%
  kbl(caption = "Equipamiento de los hogares en distintas comunidades autónomas",
      align = "lccccccc") %>%
  kable_classic(full_width = FALSE, html_font = "Times")
Equipamiento de los hogares en distintas comunidades autónomas
CCAA Automovil TV_color Video Microondas Lavavajillas Telefono
España 69.0 97.6 62.4 32.3 17.0 85.2
Andalucía 66.7 98.0 82.7 24.1 12.7 74.7
Aragón 67.2 97.5 56.8 43.4 20.6 88.4
Asturias 63.7 95.2 52.1 24.4 13.3 88.1
Baleares 71.9 98.8 62.4 29.8 10.1 87.9
Canarias 72.7 96.8 68.4 27.9 5.8 75.4
Cantabria 63.4 94.9 48.9 36.5 11.2 80.5
Castilla y León 65.8 97.1 47.7 28.1 14.0 85.0
Cast.-La Mancha 61.5 97.3 53.6 21.7 7.1 72.9
Cataluña 70.4 98.1 71.1 36.8 19.8 92.2
Com. Valenciana 72.7 98.4 68.2 26.6 12.1 84.4
Extremadura 60.5 97.7 43.7 20.7 11.7 67.1
Galicia 65.5 91.3 42.7 13.5 14.6 85.9
Madrid 74.0 99.4 76.3 53.9 32.3 95.7
Murcia 69.0 98.7 59.3 19.5 12.1 81.4
Navarra 76.4 99.3 60.6 44.0 20.6 87.4
País Vasco 71.3 98.3 61.6 45.7 23.7 94.3
La Rioja 64.9 98.6 54.4 44.4 17.6 83.4
library(kableExtra)

dato23 <- data.frame(
  CCAA = c("España","Andalucía","Aragón","Asturias","Baleares","Canarias","Cantabria",
           "Castilla y León","Castilla-La Mancha","Cataluña","Com. Valenciana",
           "Extremadura","Galicia","Madrid","Murcia","Navarra","País Vasco","La Rioja"),
  D2 = c(0.40, 3.93, 1.94, 4.46, 6.02, 10.47, 7.27, 3.25, 4.12, 4.21, 2.85,
         0.29, 13.30, 9.49, 4.61, 9.58, 2.55, 4.25),
  p_value = c(0.99, 0.68, 0.92, 9.61, 0.42, 0.10, 0.29, 0.77, 0.66, 0.64, 0.82,
              0.15, 0.03, 0.14, 0.59, 0.14, 0.86, 0.64)
)

dato23 %>%
  kbl(
    caption = "Cuadro 3.23: Resultados de la detección de outliers",
    col.names = c("CC.AA.", "D²", "p-value (χ²(df = 6))"),
    align = "lccc"
  ) %>%
  kable_classic(full_width = FALSE, html_font = "Times")
Cuadro 3.23: Resultados de la detección de outliers
CC.AA. p-value (χ²(df = 6))
España 0.40 0.99
Andalucía 3.93 0.68
Aragón 1.94 0.92
Asturias 4.46 9.61
Baleares 6.02 0.42
Canarias 10.47 0.10
Cantabria 7.27 0.29
Castilla y León 3.25 0.77
Castilla-La Mancha 4.12 0.66
Cataluña 4.21 0.64
Com. Valenciana 2.85 0.82
Extremadura 0.29 0.15
Galicia 13.30 0.03
Madrid 9.49 0.14
Murcia 4.61 0.59
Navarra 9.58 0.14
País Vasco 2.55 0.86
La Rioja 4.25 0.64

Call: hclust(d = matriz.dis.euclid.caso3, method = “average”)

Cluster method : average Distance : euclidean Number of objects: 18

 height merge.1 merge.2

1 6.874591 -3 -18 2 7.153321 -4 -8 3 7.805767 -5 -11 4 9.339233 -1 3 5 9.387225 -16 -17 6 12.380548 1 5

library(NbClust)

Datos.NbClust <- cuadrot[, 2:7] 

colnames(Datos.NbClust) <- c("automovil", "tvcolor", "video",
                             "microondas", "lavavajillas", "telefono")

Datos.NbClust[] <- lapply(Datos.NbClust, \(x) as.numeric(as.character(x)))


res.wardD2 <- NbClust(Datos.NbClust,
                      distance = "euclidean",
                      min.nc = 2,
                      max.nc = 15,
                      method = "ward.D2",
                      index = "alllong")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 6 proposed 2 as the best number of clusters 
## * 9 proposed 3 as the best number of clusters 
## * 2 proposed 4 as the best number of clusters 
## * 1 proposed 7 as the best number of clusters 
## * 1 proposed 13 as the best number of clusters 
## * 8 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************
print(res.wardD2)
## $All.index
##        KL      CH Hartigan     CCC    Scott      Marriot      TrCovW    TraceW
## 2  1.3084 10.9552   8.6730 13.6750 182.7994 1.458381e+15 612504.2932 3659.0000
## 3  3.0067 11.9843   3.6644  8.7681 207.2635 8.429416e+14 195076.6371 2372.7967
## 4  0.9890 10.4186   3.3738  8.3192 227.0281 4.998113e+14 162904.8130 1906.9433
## 5  0.7929  9.7875   4.0918  7.8431 248.9889 2.305508e+14 115617.9267 1536.6400
## 6  1.3688 10.2580   3.2810  7.7737 278.7225 6.364045e+13  59789.0250 1168.7690
## 7  1.2121 10.4798   2.9057  7.5738 325.9617 6.278598e+12  37773.1391  917.8223
## 8  1.3307 10.7005   2.3395  7.3293 351.2572 2.011553e+12  26359.7034  726.0390
## 9  1.0311 10.6612   2.3793  6.9147 375.8871 6.480048e+11  18848.5893  588.3850
## 10 0.9984 10.8857   2.6022  6.5456 397.9883 2.343403e+11  10040.9976  465.3583
## 11 1.0601 11.5885   2.8409  6.2961 427.8106 5.408748e+10   7201.2983  351.1417
## 12 1.2541 12.9161   2.6984  6.1704 480.8366 3.382908e+09   3896.7090  249.7733
## 13 1.5600 14.4913   1.9617  5.9377 977.1256 4.200000e-03   1829.0523  172.2883
## 14 1.0507 15.0205   2.2118  5.0933      NaN 0.000000e+00   1084.5981  123.7400
## 15 1.4032 16.3636   1.8571  4.1649      NaN 0.000000e+00    511.7484   79.6800
##         Friedman     Rubin Cindex     DB Silhouette     Duda Pseudot2   Beale
## 2   7.132820e+03  129.4288 0.3702 1.0867     0.3181   0.5504   8.1671  2.8565
## 3   8.070990e+03  199.5873 0.3956 1.0531     0.3161   4.2684  -3.0629 -2.3568
## 4   8.221908e+03  248.3451 0.6022 0.8698     0.3419   0.5515   3.2527  2.5028
## 5   1.185957e+04  308.1919 0.5579 0.8452     0.3434   8.6970  -3.5401 -2.7240
## 6   1.294684e+04  405.1956 0.5124 0.7586     0.3787  15.7979  -1.8734 -2.4025
## 7   6.141230e+04  515.9823 0.5742 0.6390     0.4212   0.4986   3.0168  2.9016
## 8   6.324076e+04  652.2791 0.6817 0.6654     0.3972  18.0014  -2.8333 -2.7252
## 9   9.074476e+04  804.8813 0.7296 0.6394     0.4261  28.6303  -0.9651 -1.8565
## 10  1.012226e+05 1017.6675 0.6985 0.5764     0.4785  24.9376  -1.9198 -2.4620
## 11  1.125365e+05 1348.6866 0.6335 0.5071     0.5212  45.5972  -0.9781 -1.8815
## 12  2.861114e+05 1896.0393 0.6906 0.4416     0.5749 132.3449   0.0000  0.0000
## 13  4.222924e+15 2748.7645 0.7908 0.3890     0.6565  60.4131  -0.9834 -1.8918
## 14 -1.880118e+15 3827.2188 0.8114 0.3620     0.6838 189.1116   0.0000  0.0000
## 15  1.049247e+16 5943.5248 0.9224 0.2904     0.7667 380.6423   0.0000  0.0000
##    Ratkowsky      Ball Ptbiserial     Gap   Frey McClain  Gamma  Gplus     Tau
## 2     0.3969 1829.5000     0.4539 -0.6438 0.4531  0.6077 0.5302 8.9542 20.2092
## 3     0.4340  790.9322     0.5012 -1.1169 0.1493  1.4282 0.6811 5.0654 21.6340
## 4     0.3967  476.7358     0.5195 -1.4087 0.4821  1.5890 0.7456 3.7582 22.0261
## 5     0.3825  307.3280     0.5029 -1.6960 0.1499  2.0181 0.7887 2.6732 19.9608
## 6     0.3590  194.7948     0.5151 -0.1846 0.2784  2.2708 0.8795 1.3399 19.5556
## 7     0.3448  131.1175     0.5091  0.4693 0.8636  2.4952 0.9154 0.8562 18.5229
## 8     0.3307   90.7549     0.4538  0.9413 0.4384  3.3721 0.9210 0.6275 14.6275
## 9     0.3171   65.3761     0.4183  1.1037 0.3168  4.2268 0.9445 0.3529 12.0131
## 10    0.3033   46.5358     0.3994  1.5455 0.2956  4.7948 0.9551 0.2484 10.5621
## 11    0.2912   31.9220     0.3639  2.1145 0.2159  6.0396 0.9645 0.1503  8.1699
## 12    0.2809   20.8144     0.3363  2.5283 0.1518  7.2496 0.9863 0.0458  6.5882
## 13    0.2713   13.2529     0.3218  3.0074 0.4252  7.9794 0.9955 0.0131  5.7386
## 14    0.2631    8.8386     0.2690  3.3806 0.2014 11.4837 0.9966 0.0065  3.8824
## 15    0.2559    5.3120     0.2394  3.8445 0.4367 14.4335 1.0000 0.0000  2.9412
##      Dunn Hubert SDindex  Dindex   SDbw
## 2  0.2865  2e-04  0.2608 13.1959 1.1107
## 3  0.3897  2e-04  0.2129 10.7652 0.7478
## 4  0.4871  3e-04  0.1881  9.4658 0.2643
## 5  0.4871  3e-04  0.2115  8.2952 0.2323
## 6  0.4978  4e-04  0.1814  7.1368 0.1605
## 7  0.5872  4e-04  0.1632  6.2177 0.1032
## 8  0.5566  4e-04  0.2029  5.5115 0.0873
## 9  0.6357  4e-04  0.2023  4.8338 0.0743
## 10 0.6418  4e-04  0.2219  4.1291 0.0581
## 11 0.6418  4e-04  0.2242  3.4737 0.0456
## 12 0.7629  4e-04  0.2223  2.8410 0.0317
## 13 0.9264  4e-04  0.2231  2.1494 0.0177
## 14 0.8976  4e-04  0.3024  1.7345 0.0150
## 15 1.0795  5e-04  0.2987  1.2130 0.0087
## 
## $All.CriticalValues
##    CritValue_Duda CritValue_PseudoT2 Fvalue_Beale CritValue_Gap
## 2          0.3979            15.1322       0.0163        0.5234
## 3          0.1924            16.7852       1.0000        0.3500
## 4          0.1924            16.7852       0.0504        0.3758
## 5          0.1924            16.7852       1.0000       -0.7994
## 6          0.0348            55.4763       1.0000        0.3193
## 7          0.1255            20.9054       0.0369        0.6980
## 8          0.1255            20.9054       1.0000        1.1293
## 9         -0.0981           -11.1930       1.0000        1.1200
## 10         0.0348            55.4763       1.0000        1.2283
## 11        -0.0981           -11.1930       1.0000        1.5758
## 12        -0.3211             0.0000          NaN        1.8159
## 13        -0.0981           -11.1930       1.0000        2.1872
## 14        -0.3211             0.0000          NaN        2.5587
## 15        -0.3211             0.0000          NaN        2.7794
## 
## $Best.nc
##                     KL      CH Hartigan    CCC   Scott     Marriot   TrCovW
## Number_clusters 3.0000 15.0000   3.0000  2.000  13.000 3.00000e+00      3.0
## Value_Index     3.0067 16.3636   5.0086 13.675 496.289 2.72309e+14 417427.7
##                 TraceW     Friedman    Rubin Cindex      DB Silhouette   Duda
## Number_clusters   3.00 1.500000e+01   3.0000 2.0000 15.0000    15.0000 2.0000
## Value_Index     820.35 1.237259e+16 -21.4006 0.3702  0.2904     0.7667 0.5504
##                 PseudoT2   Beale Ratkowsky     Ball PtBiserial     Gap Frey
## Number_clusters   2.0000  3.0000     3.000    3.000     4.0000  2.0000    1
## Value_Index       8.1671 -2.3568     0.434 1038.568     0.5195 -0.6438   NA
##                 McClain Gamma Gplus     Tau    Dunn Hubert SDindex Dindex
## Number_clusters  2.0000    15    15  4.0000 15.0000      0  7.0000      0
## Value_Index      0.6077     1     0 22.0261  1.0795      0  0.1632      0
##                    SDbw
## Number_clusters 15.0000
## Value_Index      0.0087
## 
## $Best.partition
##  [1] 1 1 2 3 1 1 3 3 3 2 1 3 3 2 1 2 2 2
hclust.ward.caso3 <- hclust(matriz.dis.euclid.caso3, method = "ward.D2")

grupo.ward <- cutree(hclust.ward.caso3, k = 2, h = NULL)


cuadro.caso3.grupos <- cbind(cuadrot, grupo.ward)


print(cuadro.caso3.grupos)
##               CCAA Automovil TV_color Video Microondas Lavavajillas Telefono
## 1           España      69.0     97.6  62.4       32.3         17.0     85.2
## 2        Andalucía      66.7     98.0  82.7       24.1         12.7     74.7
## 3           Aragón      67.2     97.5  56.8       43.4         20.6     88.4
## 4         Asturias      63.7     95.2  52.1       24.4         13.3     88.1
## 5         Baleares      71.9     98.8  62.4       29.8         10.1     87.9
## 6         Canarias      72.7     96.8  68.4       27.9          5.8     75.4
## 7        Cantabria      63.4     94.9  48.9       36.5         11.2     80.5
## 8  Castilla y León      65.8     97.1  47.7       28.1         14.0     85.0
## 9  Cast.-La Mancha      61.5     97.3  53.6       21.7          7.1     72.9
## 10        Cataluña      70.4     98.1  71.1       36.8         19.8     92.2
## 11 Com. Valenciana      72.7     98.4  68.2       26.6         12.1     84.4
## 12     Extremadura      60.5     97.7  43.7       20.7         11.7     67.1
## 13         Galicia      65.5     91.3  42.7       13.5         14.6     85.9
## 14          Madrid      74.0     99.4  76.3       53.9         32.3     95.7
## 15          Murcia      69.0     98.7  59.3       19.5         12.1     81.4
## 16         Navarra      76.4     99.3  60.6       44.0         20.6     87.4
## 17      País Vasco      71.3     98.3  61.6       45.7         23.7     94.3
## 18        La Rioja      64.9     98.6  54.4       44.4         17.6     83.4
##    grupo.ward
## 1           1
## 2           1
## 3           2
## 4           1
## 5           1
## 6           1
## 7           1
## 8           1
## 9           1
## 10          2
## 11          1
## 12          1
## 13          1
## 14          2
## 15          1
## 16          2
## 17          2
## 18          2
plot(hclust.ward.caso3, labels = cuadrot$CC.AA., 
     main = "Clustering Jerárquico - Método Ward",
     xlab = "Comunidades Autónomas", 
     ylab = "Distancia",
     cex = 0.8)
rect.hclust(hclust.ward.caso3, k = 2, border = c("red", "blue"))

library(dendextend)

dend <- as.dendrogram(hclust.ward.caso3)
labels_colors(dend) <- grupo.ward + 1

plot(dend, 
     main = "Dendrograma - Asignación de Clusters",
     ylab = "Altura",
     cex = 0.7)
legend("topright", 
       legend = paste("Cluster", 1:2), 
       fill = 2:3, 
       cex = 0.8)

library(factoextra)

fviz_cluster(list(data = cuadrot[, -1], cluster = grupo.ward),
             palette = c("#2E9FDF", "#00AFBB"),
             geom = "point",
             ellipse.type = "convex",
             ggtheme = theme_bw(),
             main = "Visualización PCA - Clusters Ward")