Un ejemplo de aplicación del análisis de conglomrados
Diseño de un plan de incentivos para vendedores
El director de ventas de una cadena de tiendas de electrodomésticos con implantación nacional está estudiando el plan de incentivos de sus vendedores. Considera que los incentivos deben estar ajustados a las dificultades de las distintas zonas de ventas, siendo necesario fijar incentivos más altos en aquellas zonas geográficas en que las condiciones de vida de sus habitantes hacen más difícil las ventas. Por este motivo quiere determinar si las comunidades autónomas se pueden segmentar en grupos homogéneos respecto al equipamiento de los hogares. Para ello dispone de los datos que aparecen en el cuadro 3.22, y el objetivo es establecer cuantas grupos de comunidades autonomas con niveles de equipamiento similar pueden establecerse y en que radican las diferencias entre esos grupos. el procedimiento que aplicaremos es el escrito en el sigui8ente tema a saber.
Análisis de la existencia de outliers en la medida en que pueden generar importantes distorsiones en la detección del número de grupos.
Realización de un análisis de conglomerados jerárquicos, evaluando solución de distintos métodos de conglomeración, aplicando los criterios presentados para identificar el número adecuado de grupos y obtención de los centroides que han de servir de partida para el paso siguiente.
Realización de un análisis de conglomerados no jerárquico mediante el método de k-medias para la obtención de una solución óptima en términos de homogeneidad intrasegmentos y heterogeneidad intersegmentos.
library(kableExtra)
cuadrot <- data.frame(
CCAA = c("España","Andalucía","Aragón","Asturias","Baleares","Canarias","Cantabria",
"Castilla y León","Cast.-La Mancha","Cataluña","Com. Valenciana","Extremadura",
"Galicia","Madrid","Murcia","Navarra","País Vasco","La Rioja"),
Automovil = c(69.0,66.7,67.2,63.7,71.9,72.7,63.4,65.8,61.5,70.4,72.7,60.5,65.5,74.0,69.0,76.4,71.3,64.9),
TV_color = c(97.6,98.0,97.5,95.2,98.8,96.8,94.9,97.1,97.3,98.1,98.4,97.7,91.3,99.4,98.7,99.3,98.3,98.6),
Video = c(62.4,82.7,56.8,52.1,62.4,68.4,48.9,47.7,53.6,71.1,68.2,43.7,42.7,76.3,59.3,60.6,61.6,54.4),
Microondas = c(32.3,24.1,43.4,24.4,29.8,27.9,36.5,28.1,21.7,36.8,26.6,20.7,13.5,53.9,19.5,44.0,45.7,44.4),
Lavavajillas = c(17.0,12.7,20.6,13.3,10.1,5.8,11.2,14.0,7.1,19.8,12.1,11.7,14.6,32.3,12.1,20.6,23.7,17.6),
Telefono = c(85.2,74.7,88.4,88.1,87.9,75.4,80.5,85.0,72.9,92.2,84.4,67.1,85.9,95.7,81.4,87.4,94.3,83.4)
)
cuadrot %>%
kbl(caption = "Equipamiento de los hogares en distintas comunidades autónomas",
align = "lccccccc") %>%
kable_classic(full_width = FALSE, html_font = "Times")
| CCAA | Automovil | TV_color | Video | Microondas | Lavavajillas | Telefono |
|---|---|---|---|---|---|---|
| España | 69.0 | 97.6 | 62.4 | 32.3 | 17.0 | 85.2 |
| Andalucía | 66.7 | 98.0 | 82.7 | 24.1 | 12.7 | 74.7 |
| Aragón | 67.2 | 97.5 | 56.8 | 43.4 | 20.6 | 88.4 |
| Asturias | 63.7 | 95.2 | 52.1 | 24.4 | 13.3 | 88.1 |
| Baleares | 71.9 | 98.8 | 62.4 | 29.8 | 10.1 | 87.9 |
| Canarias | 72.7 | 96.8 | 68.4 | 27.9 | 5.8 | 75.4 |
| Cantabria | 63.4 | 94.9 | 48.9 | 36.5 | 11.2 | 80.5 |
| Castilla y León | 65.8 | 97.1 | 47.7 | 28.1 | 14.0 | 85.0 |
| Cast.-La Mancha | 61.5 | 97.3 | 53.6 | 21.7 | 7.1 | 72.9 |
| Cataluña | 70.4 | 98.1 | 71.1 | 36.8 | 19.8 | 92.2 |
| Com. Valenciana | 72.7 | 98.4 | 68.2 | 26.6 | 12.1 | 84.4 |
| Extremadura | 60.5 | 97.7 | 43.7 | 20.7 | 11.7 | 67.1 |
| Galicia | 65.5 | 91.3 | 42.7 | 13.5 | 14.6 | 85.9 |
| Madrid | 74.0 | 99.4 | 76.3 | 53.9 | 32.3 | 95.7 |
| Murcia | 69.0 | 98.7 | 59.3 | 19.5 | 12.1 | 81.4 |
| Navarra | 76.4 | 99.3 | 60.6 | 44.0 | 20.6 | 87.4 |
| País Vasco | 71.3 | 98.3 | 61.6 | 45.7 | 23.7 | 94.3 |
| La Rioja | 64.9 | 98.6 | 54.4 | 44.4 | 17.6 | 83.4 |
library(kableExtra)
dato23 <- data.frame(
CCAA = c("España","Andalucía","Aragón","Asturias","Baleares","Canarias","Cantabria",
"Castilla y León","Castilla-La Mancha","Cataluña","Com. Valenciana",
"Extremadura","Galicia","Madrid","Murcia","Navarra","País Vasco","La Rioja"),
D2 = c(0.40, 3.93, 1.94, 4.46, 6.02, 10.47, 7.27, 3.25, 4.12, 4.21, 2.85,
0.29, 13.30, 9.49, 4.61, 9.58, 2.55, 4.25),
p_value = c(0.99, 0.68, 0.92, 9.61, 0.42, 0.10, 0.29, 0.77, 0.66, 0.64, 0.82,
0.15, 0.03, 0.14, 0.59, 0.14, 0.86, 0.64)
)
dato23 %>%
kbl(
caption = "Cuadro 3.23: Resultados de la detección de outliers",
col.names = c("CC.AA.", "D²", "p-value (χ²(df = 6))"),
align = "lccc"
) %>%
kable_classic(full_width = FALSE, html_font = "Times")
| CC.AA. | D² | p-value (χ²(df = 6)) |
|---|---|---|
| España | 0.40 | 0.99 |
| Andalucía | 3.93 | 0.68 |
| Aragón | 1.94 | 0.92 |
| Asturias | 4.46 | 9.61 |
| Baleares | 6.02 | 0.42 |
| Canarias | 10.47 | 0.10 |
| Cantabria | 7.27 | 0.29 |
| Castilla y León | 3.25 | 0.77 |
| Castilla-La Mancha | 4.12 | 0.66 |
| Cataluña | 4.21 | 0.64 |
| Com. Valenciana | 2.85 | 0.82 |
| Extremadura | 0.29 | 0.15 |
| Galicia | 13.30 | 0.03 |
| Madrid | 9.49 | 0.14 |
| Murcia | 4.61 | 0.59 |
| Navarra | 9.58 | 0.14 |
| País Vasco | 2.55 | 0.86 |
| La Rioja | 4.25 | 0.64 |
Call: hclust(d = matriz.dis.euclid.caso3, method = “average”)
Cluster method : average Distance : euclidean Number of objects: 18
height merge.1 merge.2
1 6.874591 -3 -18 2 7.153321 -4 -8 3 7.805767 -5 -11 4 9.339233 -1 3
5 9.387225 -16 -17 6 12.380548 1 5
library(NbClust)
Datos.NbClust <- cuadrot[, 2:7]
colnames(Datos.NbClust) <- c("automovil", "tvcolor", "video",
"microondas", "lavavajillas", "telefono")
Datos.NbClust[] <- lapply(Datos.NbClust, \(x) as.numeric(as.character(x)))
res.wardD2 <- NbClust(Datos.NbClust,
distance = "euclidean",
min.nc = 2,
max.nc = 15,
method = "ward.D2",
index = "alllong")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 6 proposed 2 as the best number of clusters
## * 9 proposed 3 as the best number of clusters
## * 2 proposed 4 as the best number of clusters
## * 1 proposed 7 as the best number of clusters
## * 1 proposed 13 as the best number of clusters
## * 8 proposed 15 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
print(res.wardD2)
## $All.index
## KL CH Hartigan CCC Scott Marriot TrCovW TraceW
## 2 1.3084 10.9552 8.6730 13.6750 182.7994 1.458381e+15 612504.2932 3659.0000
## 3 3.0067 11.9843 3.6644 8.7681 207.2635 8.429416e+14 195076.6371 2372.7967
## 4 0.9890 10.4186 3.3738 8.3192 227.0281 4.998113e+14 162904.8130 1906.9433
## 5 0.7929 9.7875 4.0918 7.8431 248.9889 2.305508e+14 115617.9267 1536.6400
## 6 1.3688 10.2580 3.2810 7.7737 278.7225 6.364045e+13 59789.0250 1168.7690
## 7 1.2121 10.4798 2.9057 7.5738 325.9617 6.278598e+12 37773.1391 917.8223
## 8 1.3307 10.7005 2.3395 7.3293 351.2572 2.011553e+12 26359.7034 726.0390
## 9 1.0311 10.6612 2.3793 6.9147 375.8871 6.480048e+11 18848.5893 588.3850
## 10 0.9984 10.8857 2.6022 6.5456 397.9883 2.343403e+11 10040.9976 465.3583
## 11 1.0601 11.5885 2.8409 6.2961 427.8106 5.408748e+10 7201.2983 351.1417
## 12 1.2541 12.9161 2.6984 6.1704 480.8366 3.382908e+09 3896.7090 249.7733
## 13 1.5600 14.4913 1.9617 5.9377 977.1256 4.200000e-03 1829.0523 172.2883
## 14 1.0507 15.0205 2.2118 5.0933 NaN 0.000000e+00 1084.5981 123.7400
## 15 1.4032 16.3636 1.8571 4.1649 NaN 0.000000e+00 511.7484 79.6800
## Friedman Rubin Cindex DB Silhouette Duda Pseudot2 Beale
## 2 7.132820e+03 129.4288 0.3702 1.0867 0.3181 0.5504 8.1671 2.8565
## 3 8.070990e+03 199.5873 0.3956 1.0531 0.3161 4.2684 -3.0629 -2.3568
## 4 8.221908e+03 248.3451 0.6022 0.8698 0.3419 0.5515 3.2527 2.5028
## 5 1.185957e+04 308.1919 0.5579 0.8452 0.3434 8.6970 -3.5401 -2.7240
## 6 1.294684e+04 405.1956 0.5124 0.7586 0.3787 15.7979 -1.8734 -2.4025
## 7 6.141230e+04 515.9823 0.5742 0.6390 0.4212 0.4986 3.0168 2.9016
## 8 6.324076e+04 652.2791 0.6817 0.6654 0.3972 18.0014 -2.8333 -2.7252
## 9 9.074476e+04 804.8813 0.7296 0.6394 0.4261 28.6303 -0.9651 -1.8565
## 10 1.012226e+05 1017.6675 0.6985 0.5764 0.4785 24.9376 -1.9198 -2.4620
## 11 1.125365e+05 1348.6866 0.6335 0.5071 0.5212 45.5972 -0.9781 -1.8815
## 12 2.861114e+05 1896.0393 0.6906 0.4416 0.5749 132.3449 0.0000 0.0000
## 13 4.222924e+15 2748.7645 0.7908 0.3890 0.6565 60.4131 -0.9834 -1.8918
## 14 -1.880118e+15 3827.2188 0.8114 0.3620 0.6838 189.1116 0.0000 0.0000
## 15 1.049247e+16 5943.5248 0.9224 0.2904 0.7667 380.6423 0.0000 0.0000
## Ratkowsky Ball Ptbiserial Gap Frey McClain Gamma Gplus Tau
## 2 0.3969 1829.5000 0.4539 -0.6438 0.4531 0.6077 0.5302 8.9542 20.2092
## 3 0.4340 790.9322 0.5012 -1.1169 0.1493 1.4282 0.6811 5.0654 21.6340
## 4 0.3967 476.7358 0.5195 -1.4087 0.4821 1.5890 0.7456 3.7582 22.0261
## 5 0.3825 307.3280 0.5029 -1.6960 0.1499 2.0181 0.7887 2.6732 19.9608
## 6 0.3590 194.7948 0.5151 -0.1846 0.2784 2.2708 0.8795 1.3399 19.5556
## 7 0.3448 131.1175 0.5091 0.4693 0.8636 2.4952 0.9154 0.8562 18.5229
## 8 0.3307 90.7549 0.4538 0.9413 0.4384 3.3721 0.9210 0.6275 14.6275
## 9 0.3171 65.3761 0.4183 1.1037 0.3168 4.2268 0.9445 0.3529 12.0131
## 10 0.3033 46.5358 0.3994 1.5455 0.2956 4.7948 0.9551 0.2484 10.5621
## 11 0.2912 31.9220 0.3639 2.1145 0.2159 6.0396 0.9645 0.1503 8.1699
## 12 0.2809 20.8144 0.3363 2.5283 0.1518 7.2496 0.9863 0.0458 6.5882
## 13 0.2713 13.2529 0.3218 3.0074 0.4252 7.9794 0.9955 0.0131 5.7386
## 14 0.2631 8.8386 0.2690 3.3806 0.2014 11.4837 0.9966 0.0065 3.8824
## 15 0.2559 5.3120 0.2394 3.8445 0.4367 14.4335 1.0000 0.0000 2.9412
## Dunn Hubert SDindex Dindex SDbw
## 2 0.2865 2e-04 0.2608 13.1959 1.1107
## 3 0.3897 2e-04 0.2129 10.7652 0.7478
## 4 0.4871 3e-04 0.1881 9.4658 0.2643
## 5 0.4871 3e-04 0.2115 8.2952 0.2323
## 6 0.4978 4e-04 0.1814 7.1368 0.1605
## 7 0.5872 4e-04 0.1632 6.2177 0.1032
## 8 0.5566 4e-04 0.2029 5.5115 0.0873
## 9 0.6357 4e-04 0.2023 4.8338 0.0743
## 10 0.6418 4e-04 0.2219 4.1291 0.0581
## 11 0.6418 4e-04 0.2242 3.4737 0.0456
## 12 0.7629 4e-04 0.2223 2.8410 0.0317
## 13 0.9264 4e-04 0.2231 2.1494 0.0177
## 14 0.8976 4e-04 0.3024 1.7345 0.0150
## 15 1.0795 5e-04 0.2987 1.2130 0.0087
##
## $All.CriticalValues
## CritValue_Duda CritValue_PseudoT2 Fvalue_Beale CritValue_Gap
## 2 0.3979 15.1322 0.0163 0.5234
## 3 0.1924 16.7852 1.0000 0.3500
## 4 0.1924 16.7852 0.0504 0.3758
## 5 0.1924 16.7852 1.0000 -0.7994
## 6 0.0348 55.4763 1.0000 0.3193
## 7 0.1255 20.9054 0.0369 0.6980
## 8 0.1255 20.9054 1.0000 1.1293
## 9 -0.0981 -11.1930 1.0000 1.1200
## 10 0.0348 55.4763 1.0000 1.2283
## 11 -0.0981 -11.1930 1.0000 1.5758
## 12 -0.3211 0.0000 NaN 1.8159
## 13 -0.0981 -11.1930 1.0000 2.1872
## 14 -0.3211 0.0000 NaN 2.5587
## 15 -0.3211 0.0000 NaN 2.7794
##
## $Best.nc
## KL CH Hartigan CCC Scott Marriot TrCovW
## Number_clusters 3.0000 15.0000 3.0000 2.000 13.000 3.00000e+00 3.0
## Value_Index 3.0067 16.3636 5.0086 13.675 496.289 2.72309e+14 417427.7
## TraceW Friedman Rubin Cindex DB Silhouette Duda
## Number_clusters 3.00 1.500000e+01 3.0000 2.0000 15.0000 15.0000 2.0000
## Value_Index 820.35 1.237259e+16 -21.4006 0.3702 0.2904 0.7667 0.5504
## PseudoT2 Beale Ratkowsky Ball PtBiserial Gap Frey
## Number_clusters 2.0000 3.0000 3.000 3.000 4.0000 2.0000 1
## Value_Index 8.1671 -2.3568 0.434 1038.568 0.5195 -0.6438 NA
## McClain Gamma Gplus Tau Dunn Hubert SDindex Dindex
## Number_clusters 2.0000 15 15 4.0000 15.0000 0 7.0000 0
## Value_Index 0.6077 1 0 22.0261 1.0795 0 0.1632 0
## SDbw
## Number_clusters 15.0000
## Value_Index 0.0087
##
## $Best.partition
## [1] 1 1 2 3 1 1 3 3 3 2 1 3 3 2 1 2 2 2
hclust.ward.caso3 <- hclust(matriz.dis.euclid.caso3, method = "ward.D2")
grupo.ward <- cutree(hclust.ward.caso3, k = 2, h = NULL)
cuadro.caso3.grupos <- cbind(cuadrot, grupo.ward)
print(cuadro.caso3.grupos)
## CCAA Automovil TV_color Video Microondas Lavavajillas Telefono
## 1 España 69.0 97.6 62.4 32.3 17.0 85.2
## 2 Andalucía 66.7 98.0 82.7 24.1 12.7 74.7
## 3 Aragón 67.2 97.5 56.8 43.4 20.6 88.4
## 4 Asturias 63.7 95.2 52.1 24.4 13.3 88.1
## 5 Baleares 71.9 98.8 62.4 29.8 10.1 87.9
## 6 Canarias 72.7 96.8 68.4 27.9 5.8 75.4
## 7 Cantabria 63.4 94.9 48.9 36.5 11.2 80.5
## 8 Castilla y León 65.8 97.1 47.7 28.1 14.0 85.0
## 9 Cast.-La Mancha 61.5 97.3 53.6 21.7 7.1 72.9
## 10 Cataluña 70.4 98.1 71.1 36.8 19.8 92.2
## 11 Com. Valenciana 72.7 98.4 68.2 26.6 12.1 84.4
## 12 Extremadura 60.5 97.7 43.7 20.7 11.7 67.1
## 13 Galicia 65.5 91.3 42.7 13.5 14.6 85.9
## 14 Madrid 74.0 99.4 76.3 53.9 32.3 95.7
## 15 Murcia 69.0 98.7 59.3 19.5 12.1 81.4
## 16 Navarra 76.4 99.3 60.6 44.0 20.6 87.4
## 17 País Vasco 71.3 98.3 61.6 45.7 23.7 94.3
## 18 La Rioja 64.9 98.6 54.4 44.4 17.6 83.4
## grupo.ward
## 1 1
## 2 1
## 3 2
## 4 1
## 5 1
## 6 1
## 7 1
## 8 1
## 9 1
## 10 2
## 11 1
## 12 1
## 13 1
## 14 2
## 15 1
## 16 2
## 17 2
## 18 2
plot(hclust.ward.caso3, labels = cuadrot$CC.AA.,
main = "Clustering Jerárquico - Método Ward",
xlab = "Comunidades Autónomas",
ylab = "Distancia",
cex = 0.8)
rect.hclust(hclust.ward.caso3, k = 2, border = c("red", "blue"))
library(dendextend)
dend <- as.dendrogram(hclust.ward.caso3)
labels_colors(dend) <- grupo.ward + 1
plot(dend,
main = "Dendrograma - Asignación de Clusters",
ylab = "Altura",
cex = 0.7)
legend("topright",
legend = paste("Cluster", 1:2),
fill = 2:3,
cex = 0.8)
library(factoextra)
fviz_cluster(list(data = cuadrot[, -1], cluster = grupo.ward),
palette = c("#2E9FDF", "#00AFBB"),
geom = "point",
ellipse.type = "convex",
ggtheme = theme_bw(),
main = "Visualización PCA - Clusters Ward")